학술
기타
Best-first search–based approach for mining top-k closed frequent itemsets from uncertain databases
PLOS ONE
조회 0
CC BY
이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.Figures
Abstract
Uncertain data mining has become critical due to data generated by sensor networks, RFID systems, and data integration platforms. Mining top-k closed frequent itemsets from uncertain databases is particularly challenging because probabilistic support evaluation is expensive and the search space grows exponentially. Most existing methods rely on depth-first search (DFS) traversal, which explores candidates in enumeration order and often discovers high-support patterns late, leading to weak pruning and costly closure verification. This paper proposes TUFCI, a best-first-search-based algorithm for mining top-k closed frequent itemsets from uncertain databases. TUFCI explores candidates in descending order of probabilistic support using a priority queue, enabling early discovery of strong patterns, rapid threshold elevation, and safe early termination. Support-ordered exploration also improves closure checking by prioritizing supersets most likely to violate the closure property, thereby reducing redundant superset examinations. Experimental results demonstrate that TUFCI significantly outperforms DFS-based approaches in runtime and reduces the number of closure checks, especially on dense datasets.
Citation: Le N, Vo H, Nguyen T (2026) Best-first search–based approach for mining top-k closed frequent itemsets from uncertain databases. PLoS One 21(6): e0351951. https://doi.org/10.1371/journal.pone.0351951
Editor: Tai Dinh, The Kyoto College of Graduate Studies for Informatics: Kyoto Joho Daigakuin Daigaku, JAPAN
Received: February 13, 2026; Accepted: June 2, 2026; Published: June 17, 2026
Copyright: © 2026 Le et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data for this study are publicly available from the GitHub repository (https://github.com/DRuanli/NguyenHuyThien_TUFCI).
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Frequent itemset mining has been a fundamental task in data mining since the Apriori algorithm was introduced by Agrawal and Srikant [1]. Luna et al. [2] survey more than two decades of progress in this field. Traditional algorithms assume deterministic item occurrences, which is unrealistic for many modern data sources such as sensor networks, data integration systems, and privacy-preserving applications [3]. To address this issue, Aggarwal and Yu [4] established the foundations of uncertain data management.
The exponential growth of data and the combinatorial explosion of candidate patterns make pattern mining increasingly expensive. Closed frequent itemsets provide a compact and lossless representation of frequent patterns [5], while the top-k formulation avoids the difficult task of selecting an appropriate minimum support threshold by allowing users to specify only the desired number of patterns [6]. These properties are particularly important for uncertain databases, where computing probabilistic support is costly and pruning is essential.
Most existing algorithms for top-k closed itemset mining adopt depth-first search (DFS) traversal because of its low memory usage [7,8]. However, DFS explores candidates in enumeration order determined by item indexing, without considering their probabilistic support values [8,9]. As a result, high-support patterns are often discovered late, delaying the increase of the dynamic pruning threshold and weakening pruning effectiveness. Moreover, closure verification—which requires checking whether any direct superset has the same support—becomes expensive under DFS because supersets are examined in arbitrary order rather than by their likelihood of violating the closure property. This leads to redundant superset examinations and unnecessary computation when high-support patterns appear in late-explored branches.
Prior work on uncertain frequent itemset mining has mainly focused on probabilistic support computation and pruning strategies [4,10], including optimizations based on generating functions, upper bounds, and efficient data structures [3,11,12]. In contrast, the impact of search order itself on pruning efficiency and closure verification has received limited attention. Priority-queue-guided search has been explored in high-utility itemset mining (e.g., TKO [13], kHMC [14]) and in deterministic top-k pattern mining [6]. However, these methods operate on utility functions or deterministic support—neither of which involves the probabilistic support computation that dominates cost in uncertain databases. To the best of our knowledge, TUFCI is the first algorithm to combine best-first search with closure checking for top-k frequent closed itemset mining over uncertain data. The key technical distinction is that the generating-function convolution cost ( per candidate, Section “Probabilistic Support Evaluation”) makes traversal order far more consequential than in deterministic or utility-based settings where support or utility is computed in .
This paper proposes a best-first search strategy for mining top-k closed frequent itemsets from uncertain databases. Candidates are explored in descending order of probabilistic support using a priority queue, enabling early discovery of strong patterns, rapid threshold elevation, and safe early termination. Support-ordered exploration also improves closure checking by examining supersets most likely to violate the closure property first, allowing violations to be detected early and reducing redundant superset examinations.
The main contributions of this paper are as follows:
- We propose a best-first search framework for mining top-k closed frequent itemsets from uncertain databases, where candidates are processed in descending order of probabilistic support.
- We introduce a closure verification strategy that exploits support-ordered exploration to reduce redundant superset checks and improve efficiency.
- We develop TUFCI, an algorithm integrating multiple pruning techniques designed specifically for best-first traversal.
- We conduct extensive experiments demonstrating that the proposed approach achieves substantial runtime improvements over DFS-based methods, with results validated across dense and sparse datasets using external baselines and internal ablation studies.
Literature review
Frequent itemset mining in deterministic databases
Frequent itemset mining (FIM) was first introduced by Agrawal et al. [1] through the Apriori algorithm. Han et al. [9] proposed FP-Growth, which avoids candidate generation via a pattern-growth paradigm, and Eclat [11] introduced a vertical-format intersection approach. These methods established the algorithmic foundations of deterministic pattern mining.
Closed frequent itemsets provide a lossless compression of the frequent pattern set [5]. Wang et al. [7] proposed CLOSET+ with pruning optimizations, and the TFP algorithm [6] extended closure-based mining to the top-k setting by dynamically raising the support threshold. These methods rely on depth-first traversal, where exploration order is determined by item indexing rather than support values.
Frequent itemset mining in uncertain databases
Uncertain databases assign existential probabilities to items, making support evaluation probabilistic rather than deterministic. Chui et al. [8] pioneered this area with U-Apriori, and Bernecker et al. [3] formalised probabilistic FIM under possible-world semantics. Generating functions have since become the standard approach for computing probabilistic support distributions [12]. Sun et al. [12] extended closed itemset mining to uncertain databases, and Ahmed et al. [15] explored evolutionary approaches. However, these studies focus on probabilistic modeling and computation efficiency, while the impact of search traversal order on closure verification and dynamic threshold convergence remains largely unexplored.
High-utility and top-k itemset mining
High-utility itemset mining (HUIM) extends FIM by incorporating item quantities and profits [16]. Singh and Biswas [17] survey top-k HUIM algorithms. Tseng et al. [13] proposed TKO as a single-phase approach, while Duong et al. [14] introduced threshold-raising strategies (RIU, CUD, COV). Although utility mining differs from uncertain FIM in its objective function and monotonicity properties, both domains share the challenge of delayed threshold convergence under depth-first traversal, including recent works in the field [18,19].
Search strategies in pattern mining
Priority-queue-guided exploration has been adopted in high-utility mining (TKO [13], kHMC [14]) and deterministic top-k pattern mining [6]. However, uncertain FIM has predominantly relied on level-wise or pattern-growth enumeration [8,12]. The computational bottleneck unique to uncertain databases— polynomial convolution per candidate—is absent in deterministic and utility-based settings, where support or utility is computable in . This cost asymmetry makes traversal order more consequential in the uncertain setting and motivates the best-first framework proposed in this paper.
To sharpen the novelty claim, Table 1 summarises how TUFCI differs from the most closely related prior frameworks along four technical dimensions.
Three distinctions separate TUFCI from prior best-first work. First, TKO and kHMC operate on deterministic utility functions computable in ; the Poisson-binomial convolution in TUFCI makes traversal order far more consequential per candidate, as a single skipped convolution saves quadratically more work. Second, neither TKO nor kHMC incorporates closure checking: they mine top-k itemsets, not top-k closed itemsets, so their priority queues need not interact with a closure verification module. TUFCI’s support-ordered closure verification (Theorem 2) is a novel component that exploits the same priority ordering to reduce superset examinations. Third, TopKPFIM and ITUFP operate on uncertain data but use depth-first traversal and have no early termination guarantee; TUFCI’s Lemma 5 provides a provably safe stopping condition absent in those methods.
In summary, existing work has extensively investigated probabilistic support computation, pruning strategies, and compact pattern representations. Priority-based exploration has been adopted in utility mining but not in uncertain frequent itemset mining, where the dominant cost structure (polynomial convolution) differs fundamentally. The influence of search order on closure verification efficiency and dynamic threshold convergence has received limited attention in the uncertain setting. This gap motivates TUFCI: a best-first search framework that aligns the exploration order with the top-k objective, specifically targeting the high per-candidate cost of probabilistic support evaluation.
Problem statement and related definitions
For clarity, Table 4 summarises the notation used throughout this paper. The algorithm is referred to exclusively as TUFCI (Top-k Uncertain Frequent Closed Itemset mining). We reserve lowercase Greek letters for thresholds and probabilities (, , ), calligraphic letters for collections and data structures (, , ), and adopt two shorthand conventions: for when threshold is clear, and for when the heap context is clear from context.
Uncertain database model
We adopt the tuple-level independence assumption commonly used in probabilistic data mining. Under this assumption, the occurrence of each item in a transaction is modeled as an independent Bernoulli random variable, and item existence events are independent across both items and transactions.
Definition 1 (Itemset). Let denote the universe of items. An itemset X is a non-empty subset of J, i.e., and . Its cardinality is denoted by |X|.
Definition 2 (Uncertain transaction). An uncertain transaction T over J is defined as a set of item–probability pairs
(1)where P(x,T)=p denotes the probability that item x appears in T. Items with P(x,T)=0 are considered absent.
Definition 3 (Uncertain database). An uncertain database is a collection of N uncertain transactions,
(2)For convenience, denotes the existence probability of item x in transaction , and if .
Under the tuple-level independence assumption, for any transaction and any itemset , the probability that X appears in is
(3)Example 1. Consider the uncertain database in Table 2, where J={A,B,C,D,E}.
For itemset X={A,B}, the probability that X appears in transaction T1 is
(4)Since P(A,T4)=0, we obtain
(5)Probabilistic support
The goal of this subsection is to define the probabilistic support , the central quantity used throughout the paper for ranking and pruning. Because item occurrences are uncertain, support is no longer a simple count but a random variable. We build up to the final definition in three steps, each motivated by a concrete need: (1) we define support as a random variable (Definition 4) to capture the inherent uncertainty; (2) we introduce generating functions as the only computationally tractable way to obtain the full distribution of that random variable (Eq. 9–10), since a direct enumeration of all possible outcomes is infeasible; and (3) we define the tail probability (Definition 5) and use it to extract a single, threshold-qualified support value (Definition 6) that behaves analogously to the deterministic support count and admits the same pruning arguments. Example 2 concretizes all three steps on the running database.
Definition 4 (Support). For an itemset X in an uncertain database , the support of X is the random variable
(6)where equals 1 if and 0 otherwise.
Under the tuple-level independence assumption each indicator is Bernoulli with success probability
(7)Hence follows a Poisson-binomial distribution.
Computing requires O(|X|) multiplications since
(8)Each transaction independently “contains” itemset X with probability and “excludes” it with probability . The total support of X is therefore the sum of these independent Bernoulli outcomes.
Unlike the standard Binomial distribution, the Poisson-binomial distribution has a distinct success probability for each transaction, so its PMF has no closed-form expression. Computing it directly by enumerating all possible transaction subsets is exponentially expensive and infeasible for any realistic database. Generating functions provide an efficient alternative: by encoding each transaction’s contribution as a degree-1 polynomial and multiplying all N polynomials together, the coefficient of in the resulting product gives exactly .
To obtain the full distribution of efficiently we use generating functions. For each transaction (with ) define the polynomial
(9)and the probability generating function
(10)The coefficient of in G(z) equals (see [3]).
Having computed the distribution for each s, a natural question is: how likely is X to appear in at least s transactions? The cumulative (tail) probability answers this question. Since each term in the sum is non-negative, this function is monotonically non-increasing in s—higher support thresholds are harder to meet.
Definition 5 (Frequency function). For the frequency function of X is
(11)The probabilistic support answers: what is the largest support level s that itemset X achieves with confidence at least ? Because the frequency function is non-increasing, this maximum is well-defined and can be found efficiently via binary search on the frequency array.
Definition 6 (Probabilistic support). Let be a probability threshold. The probabilistic support of X is the largest integer s such that :
(12)If no such s exists then .
Example 2. Continuing from Example 1, let X={A}. The existence-probability vector is
so is Poisson-binomial. Table 3 gives the PMF and tail probabilities.
With we find and F{A}(4)=0.4844 < 0.7, hence
Frequent and closed itemsets
Definition 7 (Frequent itemset). Given a probability threshold and a minimum support count , an itemset X is said to be -frequent if
(13)The set of all -frequent itemsets in an uncertain database is denoted by . In top-k mining, is not fixed by the user; instead, it is supplied by the dynamic threshold derived from the result heap (Definition 11).
Intuitively, a closed itemset is one that cannot be “extended” without losing support. If adding any single item to X always reduces its support, then X captures maximal information—no larger itemset carries the same statistical signal.
Definition 8 (Closed frequent itemset). A frequent itemset is said to be closed if there exists no proper superset such that
(14)The set of all closed frequent itemsets is denoted by .
Theorem 1 (Closure checking via immediate extensions). An itemset X is closed if and only if, for all items , the probabilistic support strictly decreases when extending X by e, i.e.,
(15)Proof. By the antimonotonicity of probabilistic support, adding an item to an itemset cannot increase its support. Therefore, if the probabilistic support strictly decreases for all immediate supersets of the form with , then no proper superset of X can have the same probabilistic support as X, and X is closed.
Conversely, if there exists an item such that
then X has a proper superset with equal probabilistic support and is therefore not closed. This completes the proof. □
Example 3. Consider the uncertain database in Table 2 with probability threshold . Let X={A} be an itemset with
To determine whether X is closed, we examine all immediate extensions obtained by adding one item from . From the database we obtain
Since the probabilistic support strictly decreases for all immediate extensions of X, it follows from Theorem 1 that the itemset {A} is closed.
Search space and enumeration
Definition 9 (Set-enumeration tree). The search space of itemsets is organized as a set-enumeration tree. Given a total order on the item universe J, the children of a node representing an itemset X are all itemsets of the form
The root of the tree corresponds to the empty set.
Depth-first search (DFS) explores the set-enumeration tree by fully expanding each branch before backtracking. Its traversal order is determined solely by the predefined order and is independent of itemset support values. Consequently, itemsets with high probabilistic support may be discovered late in the search process, which delays updates of the dynamic threshold and reduces pruning effectiveness.
In contrast, best-first search maintains a priority queue of candidate itemsets ordered by probabilistic support and always expands the currently most promising candidate. This strategy favors the early discovery of high-support itemsets and enables closure checking to be performed first on the most relevant supersets.
Top-k mining formulation
Definition 10 (Top-k closed frequent itemsets). Let denote the set of all closed frequent itemsets in an uncertain database . The set of top-k closed frequent itemsets is a subset such that and
(16)That is, contains the k closed frequent itemsets with the largest probabilistic supports.
Tie handling.
When more than k closed frequent itemsets share the probabilistic support of the k-th ranked itemset, Eq. (16) alone does not single out a unique set. TUFCI returns exactly k itemsets and resolves such boundary ties deterministically, using the same Candidate Priority Order applied during mining (Definition 13): among itemsets of equal probabilistic support, those with larger tail probability are ranked higher, and any remaining ties are broken in favour of smaller cardinality. The returned set is therefore well-defined. When fewer than k closed frequent itemsets exist for the given constraints, TUFCI returns all of them.
Definition 11 (Dynamic threshold). During the mining process, a dynamic threshold is maintained. Initially, . Once k closed frequent itemsets have been identified, is updated to the probabilistic support of the current k-th ranked itemset. Any candidate itemset X satisfying
(17)can be safely pruned from further consideration.
Problem statement
Definition 12 (Problem: Top-k closed frequent itemset mining from uncertain databases). Given an uncertain transaction database , a probability threshold , and an integer , return the set of top-k closed frequent itemsets, where is defined as in Definition 10 (i.e., the k closed frequent itemsets with largest probabilistic supports ).
Practical challenges
- Expensive support computation. Evaluating requires O(|X|) multiplications per transaction and computing the full distribution of (or its tail probabilities) involves convolution operations (generating functions / Poisson-binomial computations).
- Late threshold elevation under DFS. Depth-first traversal may discover high-support closed itemsets late, delaying updates of the dynamic threshold and reducing pruning effectiveness.
- Costly closure verification. Closure checks require examining immediate supersets; under arbitrary exploration order (e.g., DFS) these checks may be repeated unnecessarily.
- Combinatorial explosion. The search space grows exponentially with the number of items |J|, creating scalability and memory-pressure issues for candidate management.
Key properties
Lemma 1 (Antimonotonicity). For any two itemsets , the probabilistic support satisfies
(18)Proof. For each transaction , define and . Since and all existence probabilities lie in [0,1], we have for every t.
We establish the result via an explicit coupling. Let be independent Uniform(0, 1) random variables, and define
By construction, each and , and variables are independent across transactions. Since , we have almost surely for every t.
Let and . The pointwise inequality yields almost surely. Therefore, for every :
By Definition 6, . Since the tail probabilities of are pointwise dominated by those of , the maximum s satisfying the -threshold for Y cannot exceed that for X. Hence . □
Lemma 2 (Safe pruning). Let be the current dynamic threshold. If an itemset X satisfies
(19)then X and all its supersets can be safely pruned from further consideration.
Proof. By Lemma 1, for any superset ,
Hence, neither X nor any of its supersets can reach the dynamic threshold and thus cannot belong to the top-k closed frequent itemsets. □
Lemma 3 (Early termination under best-first search). Assume that k closed frequent itemsets have already been found and that is the probabilistic support of the current k-th ranked itemset. If the best-first candidate X extracted from the priority queue satisfies
(20)then the search process can be terminated.
Proof. All remaining candidates in the priority queue have probabilistic support no greater than that of X, i.e.,
By Lemma 2, none of these candidates nor their supersets can be included in the top-k result set. Therefore, no further itemsets can improve the current solution, and the search can safely terminate. □
Lemma 3 highlights a fundamental advantage of best-first search: it provides a sound and natural termination condition that is not available under depth-first traversal strategies (Table 4).
Proposed methods
This section presents TUFCI, a best-first search algorithm for mining top-k closed frequent itemsets from uncertain databases. We first describe the overall framework, then detail the data structures, support computation via generating functions, pruning strategies, and the novel support-ordered closure verification technique. Finally, we present the complete algorithm with theoretical analysis.
Overview of the TUFCI framework
TUFCI addresses the limitations of depth-first search (DFS) approaches by exploring candidates in descending order of probabilistic support using a priority queue. This best-first strategy offers three key advantages over DFS-based methods:
- Early discovery of high-support patterns: By always processing the candidate with highest support first, TUFCI discovers strong patterns early, enabling rapid threshold elevation.
- Efficient closure verification: When checking whether an itemset is closed, TUFCI examines supersets in descending support order, detecting closure violations early and avoiding unnecessary computations.
- Safe early termination: Once k closed itemsets have been found and the best remaining candidate has support below the current threshold, TUFCI can safely terminate without exploring the remaining search space.
Fig 1 illustrates the complete architecture, highlighting the interaction between the vertical database, priority queue, and the closure verification module.
The algorithm operates in three phases: (1) computing frequent 1-itemsets from the vertical database, (2) initializing data structures and seeding the priority queue, and (3) best-first mining with support-ordered closure verification. The priority queue ensures candidates are processed in descending support order, enabling early termination when the best candidate falls below the dynamic threshold.
Algorithm 1 outlines the high-level execution flow of the framework.
Algorithm 1 TUFCI Framework Overview
Require: Uncertain database , parameter k
Ensure: Top-k closed frequent itemsets
1: Phase 1: Preprocessing
2: Scan to compute vertical database; compute for all 1-itemsets
3: Sort all 1-itemsets by support descending
4: Phase 2: Initialization
5: Initialize candidate structure and result set
6: Seed with frequent 1-itemsets
7: Initialize dynamic threshold
8: Phase 3: Mining
9: while is not empty do
10:
11: if then break
12: Process X and generate extensions
13: end while
14: return
As shown in Algorithm 1, the framework operates in three distinct phases:
Phase 1: Preprocessing (Lines 1–3). The algorithm scans the uncertain database once to build the vertical representation and computes for every item i in the item universe J. No items are filtered at this stage because, in top-k mining, the dynamic threshold starts at zero and only rises as the result heap fills (see detailed Phase 1 in Section “The TUFCI Algorithm” and Algorithm 4). Items registered here are sorted by descending probabilistic support to produce .
Phase 2: Initialization (Lines 4–7). Data structures are initialized. The result heap maintains the current top-k patterns, while the Priority Queue orders candidates. The queue is seeded with frequent 1-itemsets from , setting the stage for the traversal.
Phase 3: Best-First Mining (Lines 8–14). This is the core iterative process. In each step, TUFCI extracts the highest-support candidate X. Crucially, if support(X) is lower than the current dynamic threshold , the algorithm terminates immediately (Line 11), as no remaining candidate can qualify. Otherwise, closure is verified, results are updated, and valid extensions are generated for future processing.
Core mining components
The TUFCI algorithm is built upon three core components that enable best-first traversal of the itemset search space: a candidate priority ordering, a dynamic pruning threshold, and a probabilistic vertical data representation. These components are specifically designed to address limitations inherent to depth-first traversal, where candidates are explored according to a fixed enumeration order that is independent of their probabilistic support.
Candidate priority ordering
Depth-first search expands itemsets according to a predefined lexicographic order, which does not reflect their probabilistic relevance to the top-k objective. As a consequence, itemsets with high probabilistic support may be discovered late in the traversal, delaying threshold updates and weakening pruning effectiveness. In contrast, best-first traversal requires a strict ordering that prioritizes candidates with larger support values.
Definition 13 (Candidate Priority Order). For two candidate itemsets X and Y, (X has higher priority than Y) if:
- ; or
- and ; or
- , , and |X| < |Y|.
Tie-breaking rationale.
When two candidates share the same support value, the priority queue resolves ties by (i) preferring the higher tail-probability value , then (ii) preferring smaller itemsets — matching the order in Definition 13.
The probability-descending rule (rule 2) is applied first among equal-support candidates: a higher value indicates the itemset is more confidently at its support level, making it a more reliable candidate for the result heap. The size-ascending rule (rule 3) then breaks remaining ties: a smaller itemset X admits more canonical extensions than a larger Y with the same support and probability (since ), causing to rise sooner and enabling more aggressive pruning.
Empirical sensitivity.
To confirm that the tie-breaking policy does not materially affect our conclusions, we ran a supplementary experiment on the Chess dataset (k = 30) with three policies: (a) our default (size-ascending, probability-descending), (b) size-descending, probability-ascending, and (c) random. All three policies produced identical Top-k result sets (correctness is guaranteed regardless of tie-breaking). Runtime varied by less than 3% across policies, confirming that the primary performance driver is the support-descending exploration order, not the tie-breaking rule.
This ordering ensures that the search always expands the node with the highest potential first, which is the necessary condition for the Safe Early Termination property.
Dynamic pruning threshold
Unlike traditional mining where the threshold is static, TUFCI maintains a dynamic threshold derived from the current set of top-k closed patterns found so far.
Definition 14 (Dynamic support threshold). Let denote the Top-k min-heap maintained during mining. The dynamic support threshold is
(21)All pruning conditions in Algorithms 4–6 reference rather than a fixed minimum support. Since is monotonically non-decreasing over the course of mining (Proposition 1), pruning becomes progressively more aggressive.
Property 1 (Monotonicity of ) The dynamic threshold is monotonically non-decreasing over the course of mining.
Proof. The threshold can change only upon insertion into . While , insertions do not change (which remains 0). When , an insertion occurs only if the new pattern X satisfies , and the evicted pattern is the current minimum of . The new minimum is at least as large as the old minimum, because the evicted element (the previous minimum) has been replaced by a strictly superior element. Hence never decreases. □
The threshold is a lower bound on the support of any itemset that may belong to the top-k result set. By the anti-monotonicity of probabilistic support, any candidate X such that and all of its supersets can be pruned. Under best-first traversal, high-support itemsets are examined early, causing to increase sooner than under depth-first traversal and strengthening pruning effectiveness.
Probabilistic tidset
To support the rapid support calculations required by the priority queue, we employ a vertical data representation that encapsulates the uncertainty of item occurrences.
Definition 15 (Probabilistic Tidset). For an itemset X, the probabilistic tidset is defined as:
(22)where assumes independence between items.
This representation enables efficient support computation and candidate extension via set intersections.
Example 4. Consider the uncertain database in Table 2. The corresponding vertical database maps each item to its probabilistic tidset, as shown in Table 5.
As illustrated in Table 5, the vertical database V is defined by for each item .
Priority queue and result heap
The priority queue maintains candidate itemsets ordered according to the Candidate Priority Order (Definition 13). This ordering guarantees that, at each iteration, the candidate with the largest probabilistic support is selected for expansion.
The result set is implemented as a min-heap of fixed size k, ordered by probabilistic support. This structure provides constant-time access to the smallest support value among the current top-k patterns. When , this minimum value defines the dynamic threshold , which is used for pruning and early termination.
Probabilistic support evaluation
In TUFCI, the support of a candidate itemset X is a random variable representing the number of transactions in which X occurs. Assuming independence of item existence events across and within transactions, each transaction contributes a Bernoulli trial with probability , where . Under this assumption, follows a Poisson-binomial distribution.
Let denote the current dynamic threshold. To determine whether X can enter the top-k heap, we compute
where .
TUFCI computes the distribution using a Direct Convolution dynamic programming method. For the tidset probabilities (one entry per transaction containing X), let denote the probability that the support equals s after processing the first i tidset entries. The recurrence is:
(23)with D0[0]=1 and D0[s]=0 for s>0.
Algorithm 2 computes this distribution in time, where is the number of transactions containing X (the tidset size), not the total database size N. This distinction is important: anti-monotonicity guarantees that shrinks as itemsets grow longer, so the quadratic cost is paid mainly for short, high-frequency candidates. In the worst case—a single item appearing in all N transactions— and the cost is O(N2), which would be prohibitive. Three mechanisms bound in practice: (1) Strategy P7 prunes candidates whose tidset cardinality falls below the dynamic threshold ; (2) longer itemsets have smaller tidsets by anti-monotonicity; and (3) alternative convolution algorithms (e.g., FFT-based or divide-and-conquer) can reduce the per-candidate cost to when is large, though the current implementation uses direct convolution throughout. Exact support values are required to correctly order candidates in the priority queue and to enable effective pruning and early termination under the best-first search strategy.
Algorithm 2 DirectConvolutionSupport
Require: Tidset probabilities of X:
Ensure: Support distribution vector D (where )
1: Initialize array D of size with zeros
2: (Probability of support 0 is initially 1)
3: for all do
4: for j from down to 1 do
5:
6: end for
7:
8: end for
9: return D
Support-ordered closure verification
Closure verification determines whether an itemset X is closed by checking if any of its immediate supersets have the same probabilistic support. In depth-first traversal, extensions are typically examined in a fixed enumeration order that is independent of their support values, which may result in redundant support computations before a closure violation is detected. TUFCI instead evaluates 1-extensions in non-increasing order of support and terminates the check as soon as no remaining extension can match the support of X.
The method relies on the anti-monotonicity of probabilistic support with respect to itemset inclusion and on ordering extensions by their support values.
Theorem 2 (Support-ordered closure verification). Let X be an itemset with support , and let
be the set of its 1-extensions sorted in non-increasing order of support. If for some index j, then for all it holds that
Proof. Because the extensions are sorted in non-increasing order of support, for any we have
If , then for all . Hence, once an extension with support strictly smaller than is encountered, no subsequent extension can have support equal to , and the closure check can terminate. □
Lemma 4 (Closure completeness under sorted iteration). Let be the frequent items sorted in non-increasing order of singleton support. For candidate X with support , if for some index j, then for all :
Hence no item at position can produce a closure violation ().
Proof. By anti-monotonicity (Lemma 1), . Since is sorted in non-increasing order, for all . The strict inequality precludes equality with .
Remark on sort stability.
The correctness of the closure-done flag requires only that the ordering is non-increasing—not a stable sort. When multiple items share the same singleton support equal to , all such items are visited before the flag is set, because the flag triggers only on a strict decrease below . Any permutation among items with identical support is therefore harmless.
Algorithm 3 implements the combined closure verification and extension generation procedure. It iterates through frequent items in non-increasing order of singleton support, employing two independent control-flow mechanisms:
- P3 termination (line 8): The loop terminates when , because by anti-monotonicity , so no remaining item can produce a viable extension.
- Closure-done flag (line 11): When , the flag disables closure checking for subsequent items while the loop continues to generate extensions. By Lemma 4, no item beyond this point can produce a closure violation.
These two mechanisms are logically independent: P3 terminates the loop entirely (no further extensions are possible), while the closure-done flag only disables one type of check (closure violations are impossible, but extensions may still be viable).
Algorithm 3 CheckClosureAndExtend
Require: Itemset X, dynamic threshold
Ensure:
1:
2: {Flag: all remaining items have }
3:
4: for each item i in in support-descending order do
5: if then
6: continue
7: end if
8: if then
9: break {P3: anti-monotonicity ⇒ }
10: end if
11: if and then
12: (No further closure violations possible (Lemma 4))
13: end if
14:
15:
16: if then
17: continue
18: end if
19: Compute via tidset intersection and generating functions
20: if needClosure and then
21:
22: end if
23: if needExtend and then
24:
25: end if
26: end for
27: return
Processing extensions in support order ensures that if a closure-violating superset exists, it will be encountered before any lower-support extension. This reduces the number of support evaluations required for closure verification. Because best-first traversal maintains a global ordering of candidates by support, it increases the likelihood that high-support extensions are generated and checked earlier than under depth-first traversal, which explores candidates according to a fixed structural order. Consequently, support-ordered closure verification complements the best-first search strategy and contributes to reducing redundant computations in TUFCI.
Best-first search with early termination
Rationale for the search strategy
Traditional itemset mining algorithms typically employ Depth-First Search (DFS) to traverse the search space. While DFS uses memory efficiently, it traverses candidates in a predetermined structural order that is independent of their probabilistic support values. As a result, itemsets with high probabilistic support—whose early discovery is critical for elevating the dynamic threshold —may be encountered late in the traversal. This delay can keep low over much of the search and reduce the effectiveness of pruning.
In contrast, TUFCI adopts a Best-First Search (BestFS) strategy based on the priority queue. By always selecting the candidate with the highest probabilistic support for expansion, the algorithm tends to discover high-support itemsets early. Early updates to the dynamic threshold strengthen pruning conditions and reduce exploration of low-support regions of the search space.
Early termination condition
An important property of the BestFS strategy is that it enables safe early termination of the search based on the current dynamic threshold.
Lemma 5 (Early termination under best-first search). In the best-first variant (Algorithm 6), if the candidate X returned by satisfies , then all remaining candidates also satisfy , and mining may safely terminate.
Proof. The proof rests on three facts, each independently verifiable:
- (i) Support immutability. Each candidate’s probabilistic support is computed once at creation time and stored as an immutable key. The heap key therefore never changes after insertion; the priority queue is a static-key structure from the perspective of any individual candidate.
- (ii) Max-heap invariant. The priority queue is implemented as a binary max-heap keyed by probabilistic support. The standard binary heap invariant—a property of the data structure independent of the algorithm’s correctness argument—guarantees that returns the element with the largest key in time and restores the invariant after extraction. Therefore, after extracting , every remaining element satisfies as a direct consequence of the heap invariant, not as an assumption.
- (iii) Extension pruning. Given (ii), if , then for every remaining . Moreover, for any extension , anti-monotonicity (Lemma 1) gives , so no extension of any remaining candidate can enter . Combined with the monotonicity of (Property 1), the current Top-k result is final.
Remark on maintenance cost. The correctness argument above is independent of the computational cost of maintaining the heap ordering. The cost of each Insert and ExtractMax operation is , and the total overhead across the entire mining run is accounted for in the Phase 3 time complexity bound (Section). Crucially, this cost does not affect the validity of the termination condition: once X is extracted and found to satisfy , termination is correct regardless of how many heap operations were performed to reach that point.
Fig 2 illustrates this difference in traversal behavior. In the DFS case (A), exploration may proceed deep into branches with low support before reaching high-support itemsets. In the Best-First Search case (B), the highest-support itemsets are visited first, and can be updated early, enabling the remaining low-support candidates to be pruned without further expansion.
(A) Under DFS, exploration proceeds according to structural order. (B) Under best-first search, candidates with higher support are visited first, which can lead to earlier threshold updates and pruning.
The TUFCI algorithm
The TUFCI framework is architected into three distinct phases to efficiently handle the complexity of uncertain data mining. By decoupling preprocessing, initialization, and the core mining loop, the algorithm maximizes modularity and allows for specific optimizations at each stage.
Phase 1: Preprocessing and singleton support computation
The first phase (Algorithm 4) performs a single linear scan of the database to construct the vertical representation V: for each item i, the tidset is built. This scan registers every item into the item universe J, including items that appear only in late transactions. Phase 1 then iterates over J (not over transactions) and computes for each using V(i). No item can be missed regardless of its position in the database. The per-item computation cost is O(|V(i)|2) via dynamic-programming convolution over only the |V(i)| transactions containing item i.
Crucially, Phase 1 does not filter items by a minimum support threshold, because in top-k mining the dynamic threshold starts at zero and is derived from the Top-k heap state (Definition 14). Filtering occurs in Phase 2 as the heap fills.
Addressing the late-appearing items concern.
A natural question arises: does TUFCI’s single-pass preprocessing miss items that appear only in late transactions? The answer is no, by construction. The vertical database build (Algorithm 4, Line 1) scans all N transactions sequentially, registering every observed item into item universe J regardless of transaction index. Consider a concrete example:
Example (Late-appearing item): Suppose a database contains 5 transactions: T1 = {A, B}, T2 = {B, C}, T3 = {A, C}, T4 = {B, D}, T5 = {E}. Item E appears only in the final transaction. During the vertical scan:
- Transactions are processed, registering A, B, C, D into J and populating their tidsets.
- Transaction T5 is processed, registering E into J and creating .
- After the scan completes, J = {A, B, C, D, E}, and Phase 1 computes using its tidset like any other item.
Mathematically, the vertical database construction guarantees that for every item i occurring in at least one transaction , we have after the scan completes. The subsequent support computation (Algorithm 4, Lines 3–7) iterates over J, not over a pre-filtered set based on position. Therefore, late-appearing items are never excluded by position bias; they are treated identically to early-appearing items. The only reason an item might be filtered later (Phase 2 or Phase 3) is if its probabilistic support falls below the dynamically rising threshold —a correctness condition, not a positional artifact.
Preprocessing step.
Before mining begins, the algorithm performs a single linear scan of the entire database to construct the vertical representation V. During this scan, every item encountered is registered in the item universe J, and for each item i, the tidset is assembled. This preprocessing step has time complexity , where is the number of transactions and is the average transaction width. Crucially, no item is missed regardless of its position in the database—an item appearing only in the final transaction is registered and its tidset is constructed during this single pass, as demonstrated in the example above.
Algorithm 4 Phase 1: Preprocessing
Require: Uncertain database , probability threshold
Ensure: Sorted list of all 1-itemsets , vertical database V
1: {Single linear scan; registers all items into item universe J}
2: empty list
3: for all item do
4:
5: if is empty then
6: continue
7: end if
8:
9:
10: end for
11: Sort by support descending
12: return
Phase 2: Initialization
In the second phase (Algorithm 5), we initialize the core data structures: the Top-k Heap () to store the current best closed itemsets, and the Max-Priority Queue () to manage the search frontier.
The Priority Queue is seeded with the frequent 1-itemsets identified in Phase 1 (line 10 of Algorithm 5); these are the initial candidates for expansion. In addition, Phase 2 pre-computes and caches 2-itemset supports (line 11) in a separate hash table that supports the P4 upper-bound pruning strategy; the 2-itemset cache is an auxiliary lookup structure and its entries are not inserted into . A preliminary closure check is performed on each 1-itemset: if it is subsumed by a superset with equal support, it is not added to as a closed candidate, though it remains available in for extension generation.
Algorithm 5 Phase 2: Initialization
Require: Sorted frequent 1-itemsets , integer k, threshold
Ensure: Priority Queue , Result Heap , 2-itemset cache
1: { initially}
2:
3: (Auxiliary 2-itemset support cache for P4)
4: for all do
5: if is full and then
6: break {P1: remaining singletons cannot enter }
7: end if
8:
9: if isClosed then
10: { may increase}
11: end if
12: {Seed with 1-itemset for later expansion}
13: end for
14: Pre-compute and store in for all pairs with
15: return
Phase 3: Best-first mining process
The core mining loop is presented in Algorithm 6. This phase iteratively extracts the candidate with the highest support from and performs three critical operations:
- Early Termination: If the support of the extracted candidate is less than the current dynamic threshold , the algorithm terminates immediately. Since is sorted, no remaining candidate can possibly qualify for the top-k set (Lemma 5).
- Expansion & Closure: The candidate is expanded to generate supersets. We simultaneously check if the candidate is closed.
- Result Update: If closed, it is added to . If becomes full, is raised to the support of the k-th best pattern, effectively pruning the remaining search space.
Algorithm 6 Phase 3: Best-First Mining
Require: Queue , Heap , threshold
Ensure: Final Top-k closed frequent itemsets
1: while is not empty do
2:
3: if then
4: break (P2: early termination (Lemma 5))
5: end if
6:
7: if isClosed then
7: { may increase}
8: end if
9: for all do
10: if then
11:
12: end if
13: end for
14: end while
15: return
Execution walkthrough with example
To clarify the interaction between these phases, consider the uncertain database from Table 2 with threshold and k = 2.
Phase 1 (Preprocessing): Items are evaluated. Suppose the computed probabilistic supports are: A = 3, C = 3, B = 2, D = 2. Sorted .
Phase 2 (Initialization). Priority Queue is seeded with the frequent 1-itemsets.
(24)Result heap is empty;
Phase 3 (Mining).
- Iteration 1: Extract A (support 3). Check extensions (AB, AC). Assume extensions have support 2. Since sup(A)> sup(extensions), A is Closed.
- Add A to . Top-k: {A}.
- Add AB, AC to .
- Iteration 2: Extract C (support 3). Assume C is closed.
- Add C to . Top-k: {A, C}.
- Update : is full (k = 2). Minimum support in is 3. New .
- Iteration 3: Extract next best candidate (e.g., B or AB, which have support 2).
- Termination Check: Current candidate support (2) (3).
- Action: Algorithm breaks immediately.
This demonstrates how TUFCI avoids processing low-support patterns () entirely once the Top-k buffer is filled with high-quality patterns.
Pruning strategies
To address the exponential search space inherent in uncertain databases, TUFCI integrates seven distinct pruning strategies (P1–P7) organized into four semantic groups. We note that strategies P2 (dynamic-threshold pruning), P3 (item-level filtering), and P5 (upper-bound filtering, which applies the P4 test as a pre-convolution filter) are established techniques in deterministic and utility-based pattern mining [6,13,14], while P4 (subset-based upper bound) generalises the pairwise upper bound used in top-k HUIM [13,14] to the probabilistic-support setting. The novel contributions of TUFCI’s pruning framework are P1 (safe early termination unique to best-first traversal under closure verification), P6–P7 (tidset-cardinality bounds for skipping convolution before its cost is paid), and—most importantly—the integration of all seven strategies under support-ordered exploration, which makes their joint effect super-additive (Exp 2).
- G1 (Frontier Termination): P1 (Early Termination), P2 (Dynamic Threshold Pruning) — control global search space by raising and enforcing the dynamic threshold .
- G2 (Item Admissibility): P3 (Item-Level Filtering) — preprocesses the database to remove infrequent items.
- G3 (Upper Bound Pruning): P4 (Subset-Based Upper Bound), P5 (Upper Bound Filtering) — exploit anti-monotonicity and cached 2-itemset supports to avoid expensive convolution.
- G4 (Tidset Optimization): P6 (Tidset Intersection Look-ahead), P7 (Tidset-Based Early Closure Detection) — leverage tidset cardinality bounds to skip convolution when possible.
These strategies are applied hierarchically, prioritizing low-cost checks (O(1) or O(N)) to eliminate unpromising candidates before invoking the computationally expensive convolution algorithm (, where ).
Strategy 1: Early termination (P1)
This strategy, unique to the best-first approach, enables safe termination of the entire search once the priority queue head falls below the dynamic threshold. Experimental results (Exp 4) show it is the most impactful single pruning component, particularly on dense datasets where the threshold rises rapidly after discovering high-support patterns early.
(25)Strategy 2: Dynamic threshold pruning (P2)
As the mining progresses and k patterns are discovered, the internal threshold rises. By Lemma 2, any candidate X with and all its supersets can be safely pruned, since no descendant can reach the threshold.
Strategy 3: Item-level filtering (P3)
In Phase 1, once the dynamic threshold has been established from the initial closed 1-itemsets, any item i with can be permanently removed, since by anti-monotonicity no itemset containing i can exceed .
(26)This strategy effectively reduces the width of the lattice and provides consistent performance gains across dense datasets.
Strategy 4: Subset-based upper bound (P4)
When extending candidate X with item i, the basic anti-monotonicity bound gives . Strategy P4 tightens this bound by exploiting cached 2-itemset supports from Phase 2.
Property 2 (Subset upper bound) For any itemset X and item :
(27)Proof. For each , , so by Lemma 1. Taking the minimum over all yields the bound. □
P4 iterates over items , looks up each canonical 2-itemset in the cache, and tightens the upper bound. Each lookup is O(1) expected time (hash-map access). If the tightened bound falls below , the extension is pruned, saving one convolution.
Strategy 5: Upper bound filtering (P5)
After P4 computes the tightened upper bound ub for extension , P5 compares it against the current threshold:
(28)This is an O(1) check that prevents the algorithm from entering the expensive tidset-intersection and convolution stages for extensions that cannot enter the Top-k heap.
Strategy 6: Tidset intersection look-ahead (P6)
When expanding a candidate X with item i, we compute the size of the intersection of their tidsets before generating the new node.
(29)While highly effective for sparse data, this strategy is applied conditionally, as the intersection overhead can outweigh the benefits in very dense datasets.
Strategy 7: Tidset-based early closure detection (P7)
After computing the tidset intersection , the cardinality is available before the expensive convolution step.
Property 3 (Tidset cardinality bound) For any itemset X, .
Proof. is a sum of Bernoulli variables, so almost surely. For any , . □
P7 exploits this bound in two logically distinct ways, each targeting a different expensive operation:
- Skip convolution entirely (extension pruning): If , then by Property 3, , so the extension can never enter the heap. If no closure check is needed for this item (either closureDone is already set, or making it non-canonical), the convolution is skipped entirely.
- Clear closure-check flag (closure pruning): If , then , so a closure violation () is impossible. The closure-check flag is cleared, saving the equality comparison. If the extension is also not needed (), convolution is skipped; otherwise it proceeds only to compute the extension support value.
Example 1 (Branch 2 — closure pruning).
Let X = {A, B} with and . For item C with , we compute . Since 15 ≤ 17 (extension cannot be pruned by Branch 1), but (Branch 2 fires), Property 3 guarantees , so no closure violation is possible and the closure-check flag is cleared. If (canonical extension), convolution still runs to obtain the support value for heap insertion; the savings are in avoiding the closure equality test.
Example 2 (Branch 1 — skip convolution entirely).
Using the same X = {A, B} with and . Now consider item E with . We compute . Since (Branch 1 fires), the extension {A,B,E} can never reach the threshold regardless of its exact support. Both the closure check and the O(m2) convolution are skipped entirely. This is the most aggressive saving: one tidset intersection () replaces what would otherwise be a full convolution ().
Theorem 3 (Pruning Correctness). The pruning strategies P1–P7 do not eliminate any true top-k closed frequent itemset.
Proof. We treat each strategy in turn. Throughout, let denote the dynamic threshold.
P1 (early termination). When the head of extracted by has , Lemma 3 guarantees that all remaining candidates and their supersets have support and thus cannot enter .
P2 (dynamic threshold pruning). If , Lemma 2 ensures that no superset of X has support . Pruning X and its supersets is therefore safe.
P3 (item-level filtering). An item i with cannot appear in any itemset with support by Lemma 1, so removing i from the item universe J cannot eliminate any top-k candidate.
P4 (subset-based upper bound). For extension , Property 2 states . If this upper bound is , then , so cannot enter and may be pruned.
P5 (upper bound filtering). P5 simply applies the test to the upper bound ub computed by P4 (or any tighter upper bound). Correctness follows from P4.
P6 (tidset intersection look-ahead). For any itemset Z, Property 3 gives . If , then , so the extension cannot enter .
P7 (tidset-based early closure detection), Branch 1 (skip convolution entirely) If , by Property 3 the extension fails the threshold; skipping convolution is equivalent to a P6-style prune.
P7, Branch 2 (clear closure-check flag). If , by Property 3 we have , which precludes a closure violation . Clearing the closure-check flag therefore omits only equality tests that are already guaranteed to fail; no genuine closure violation is missed.
Combining the per-strategy arguments, every pruned itemset is either provably outside the top-k frontier (P1–P6 and P7 Branch 1) or is irrelevant to closure determination (P7 Branch 2). Therefore no true top-k closed frequent itemset is eliminated (Table 6).
Complexity analysis
This subsection provides formal time and space complexity bounds for TUFCI. Throughout, denotes tidset cardinality, not database size N. We emphasise at the outset that TUFCI does not improve the worst-case asymptotic complexity of top-k closed mining: without pruning it processes the same O(2|J|) candidates as a depth-first baseline (see “Worst-Case Bound” below). Its contribution is to practical, average-case efficiency—reducing the number of candidates actually processed, , through support-ordered exploration and safe early termination.
Time complexity
Phase 1: Preprocessing.
The single linear database scan constructs the vertical representation in time, where is the number of transactions and is the average transaction width. Computing probabilistic support for each 1-itemset requires polynomial convolution over its tidset: for item i with tidset cardinality , the direct convolution method (Section) costs . Summing over all items in item universe J:
(30)In the worst case where every item appears in all transactions ( for all i), this becomes .
Phase 2: Initialization.
Inserting frequent 1-itemsets into the priority queue requires time (min-heap operations for the result heap ). Closure checks for 1-itemsets are in the worst case (checking all pairs). Generating and caching 2-itemsets involves pairs, each requiring tidset intersection (O(N)) and convolution ( where is the intersection size). Total:
(31)where is the average tidset size of 2-itemsets.
Phase 3: Best-First Mining.
Let denote the number of candidates extracted from the priority queue during mining (a dataset-dependent value bounded by the search space size). For each candidate X with tidset cardinality :
- Priority queue operations: per pop and push. The max queue size is bounded by the width of the itemset lattice, typically O(|J|2) after pruning. Total PQ cost: .
- Extension generation: For each candidate X, generating extensions involves iterating over O(|J|) candidate items (reduced by P3 filtering). For each extension item i:
- Tidset intersection: worst-case
- Probabilistic support convolution: , where
- Closure checking: For each candidate entering the result heap, closure verification examines O(|J|) potential supersets (Lemma 4). Each closure check may require support comparison, which is O(1) if support is cached or if recomputation is needed. With caching (our implementation), closure cost is O(|J|) per candidate.
Combining these components and noting that pruning strategies (P1–P7) reduce significantly:
(32)where is the average tidset cardinality of candidates processed.
This expression makes the best-first trade-off explicit. The term is the priority-queue maintenance overhead that a stack-based DFS (with O(1) push/pop) does not incur. TUFCI pays it deliberately in order to keep small: by expanding candidates in descending support order, the threshold rises early, so the dominant per-candidate work—the support convolutions and the O(|J|) closure checks—is never paid on the many low-support candidates that a DFS would expand before its threshold rises. In short, best-first trades a logarithmic queue cost for a reduction in the number of expensive support and closure computations.
Worst-case bound.
In the absolute worst case without pruning, (exponential in item-universe size) and . However, this scenario is unrealistic in practice: pruning strategies (especially P1, P2, P4, P5) reduce by multiple orders of magnitude. Empirically, grows sub-exponentially with k (Exp 2 and Exp 3, Fig 4 and Fig 5, show that TUFCI processes 10–100× fewer candidates than naive DFS on dense datasets).
Space complexity
Vertical Database.
Storing the vertical representation requires space to store all item-transaction-probability triples.
Priority Queue.
The max-priority queue stores candidate itemsets with their tidsets. In the worst case, can reach O(|J|2) (all 2-itemsets), but pruning (especially P4, P5) keeps much smaller. Each candidate stores: (1) itemset (average items), (2) tidset (average transactions), (3) support value. Total queue space:
(34)2-Itemset Cache.
Phase 2 caches up to 2-itemset supports for P4 upper-bound pruning. Each entry stores a pair of items and a support value: space.
Total Space.
Combining all components:
(35)Typically, (items are far fewer than transactions), and pruning keeps . Experimental results (Exp 7, Fig 9) show that V1 (TUFCI-BestFS with full pruning) maintains peak heap memory comparable to DFS variants, confirming that the priority queue overhead is manageable.
Summary table
Table 7 summarizes the per-phase and per-operation complexity bounds.
Notation clarification.
Throughout this paper, when we write for convolution complexity, denotes the tidset cardinality of itemset X (i.e., ), not the database size N. The quadratic term is in the tidset size, which is typically much smaller than N due to item infrequency. For example, if itemset X appears in only 100 transactions out of N = 88,162 (Retail dataset), then and convolution costs O(1002) = O(10,000) operations, not O(N2).
Overall end-to-end complexity and comparison with DFS.
The dominant cost across all three phases is the convolution per candidate. The key quantity determining total runtime is therefore , the number of candidates actually processed. In the worst case (no pruning), both TUFCI and a DFS baseline process O(2|J|) candidates and have identical asymptotic complexity. The algorithmic advantage of TUFCI lies entirely in reducing in practice: by processing candidates in descending support order, TUFCI fills the result heap early, raises rapidly, and triggers early termination (P1) before the bulk of the search space is reached. A DFS baseline, exploring in enumeration order, may keep near zero for a large fraction of its run, deferring pruning and forcing convolution on many low-support candidates that TUFCI never touches.
Concretely, in the best-case where the top-k patterns are the highest-support singletons and their immediate extensions, TUFCI terminates after processing candidates, giving total time , which is near-linear in k. An equivalent DFS would still explore candidates before reaching the same termination condition. Space complexity is ; the term represents the priority queue overhead absent in stack-based DFS, but experimental results (Exp 7, Section) show this overhead is bounded well below the vertical database cost on all tested datasets.
Performance evaluation
Dataset synthesis and uncertainty generation
Since standard benchmark datasets from the SPMF repository are deterministic, we generated uncertain databases using a synthetic injection method consistent with experimental protocols in probabilistic frequent itemset mining literature. The transformation follows the tuple-independent model, where every item occurrence is associated with an independent existence probability.
To ensure realistic data characteristics, we employed a three-stage generation process (Random Seed = 42):
- Support-Correlated Probability: We reflect the real-world observation that frequent items often have higher data quality. The base probability for an item i is assigned proportional to its normalized frequency f(i):(36)
where , , and the skew exponent . - Transaction Length Adaptation: To simulate sensor noise where larger transactions are more prone to errors, we apply a penalty factor to transactions exceeding the average length :(37)
- Noise Injection: To test robustness, we injected random noise with probability , decoupling specific items from their support and assigning them a random low probability in the range .
Experimental setup
Datasets
We selected four real-world datasets representing diverse mining scenarios to evaluate the framework’s adaptability:
- Dense datasets (Chess, Mushroom): Characterized by a small item universe but high item frequency and strong correlations. These datasets generate long patterns and heavily stress the closure verification and probabilistic support computation components.
- Sparse datasets (Retail, Liquor): Characterized by large item universes and variable transaction lengths. These datasets test the algorithm’s scalability and the effectiveness of pruning strategies in reducing the search space.
The characteristics of the generated uncertain datasets are summarized in Table 8.
Analyzed variants
To comprehensively evaluate TUFCI, we compare internal variants and external baselines (Table 9):
Internal Variants:
- V1_BFS_Full: Proposed TUFCI with best-first search and full pruning (P1–P7)
- V2_DFS_Full: DFS with full pruning (P1–P7) – isolates impact of search order
- V3_BFS_Search: Best-first search with minimal pruning (P1–P3 only)
- V4_DFS_Search: DFS with minimal pruning (P1–P3 only)
External Baselines:
- TopKPFIM [20]: Li et al. (2017) – DFS-based top-k algorithm adapted for closed itemsets
- ITUFP [21]: Davashi (2023) – Interactive top-k uncertain frequent pattern miner (run in one-shot mode)
Both external baselines were adapted to output closed itemsets via post-filtering for fair comparison.
Implementation environment
All algorithms were implemented in Java (JDK 21) within the TUFCI framework. To ensure fair comparison, all experiments were executed in single-user mode with no other intensive processes running. The experiments were conducted on a workstation with the following specifications:
- Processor: Intel Core i7–12700H (2.30 GHz)
- Memory: 16 GB DDR4 RAM
- Operating System: Windows 11 (64-bit)
Reproducibility and external baseline adaptations
To maximize reproducibility and transparency, we document all experimental parameters, software configurations, and baseline adaptations:
Random Seed and Repeatability.
All experiments use a fixed random seed (seed = 42) for uncertainty injection, ensuring deterministic dataset generation across runs. Each experiment configuration was executed 5 times with different internal random seeds for statistical robustness. Results report mean ± standard deviation.
JVM Configuration.
To ensure fair comparison and avoid memory-related performance artifacts:
- JVM version: OpenJDK 21.0.1 (HotSpot)
- Heap settings: -Xms4G -Xmx12G (4GB initial, 12GB maximum)
- Garbage collector: G1GC (-XX: + UseG1GC)
- Warm-up: Each experiment runs 2 warm-up iterations (not reported) to allow JIT compilation to stabilize
Runtime Measurement.
Runtime excludes I/O (dataset loading, result writing) and includes only the core mining algorithm time from Phase 1 start to Phase 3 completion. Timing uses System.nanoTime() with microsecond precision. Memory profiling uses JVM MemoryMXBean to measure peak heap consumption during execution.
Dataset Availability.
Base deterministic datasets (Chess, Mushrooms, Retail, Liquor) are from the SPMF repository [22] (http://www.philippe-fournier-viger.com/spmf/). Uncertain versions are generated via the injection method (, , , ).
External baseline adaptations.
To enable fair comparison with TUFCI, we adapted two external algorithms:
- TopKPFIM [20]: Originally mines top-k frequent itemsets without closure guarantee. Adaptation: Added post-filtering stage that checks closedness for each itemset in the output (an itemset X is retained only if no superset has ). This adaptation increases TopKPFIM runtime by <5% but ensures output correctness for closed itemset mining.
- ITUFP [21]: Interactive top-k algorithm allowing incremental threshold adjustment. Adaptation: Disabled interactive re-mining, running ITUFP as a one-shot top-k miner with fixed k value. Both algorithms use the same uncertain database instances generated via our injection method, ensuring identical input data. Both use the same probabilistic support threshold and mine the same k values as TUFCI variants.
All adapted baselines were implemented within the same Java framework to eliminate implementation-specific biases (e.g., differences in data structure libraries, hash map implementations). The adapted code is available in the supplementary materials.
Experimental results and analysis
We evaluated TUFCI across seven experiments addressing: (1) comparison with external baselines, (2) internal variant analysis, (3) closure verification efficiency, (4) pruning group ablation, (5) parameter sensitivity, (6) group dominance, and (7) memory profiling. All experiments were conducted with 5 repetitions to ensure statistical reliability, reporting mean ± standard deviation.
Exp 1: Comparison with external baselines
This experiment directly addresses the external validity of TUFCI by comparing V1 against two published algorithms that operate on uncertain databases: TopKPFIM [20] and ITUFP [21]. Both are DFS-based and represent the current state of the art for top-k frequent pattern mining from uncertain data. Unlike TUFCI, neither algorithm natively outputs closed itemsets; both were adapted with a post-filtering closure step as described in Section. Fig 3 reports mean runtime ± standard deviation over 5 repetitions across four datasets and varying k values.
- Dominance over external methods: TUFCI-BestFS consistently outperforms both external baselines. On Chess (k = 50), TUFCI is 3.2× faster than TopKPFIM and 2.8× faster than ITUFP. The advantage increases with k due to early termination: as k grows, TUFCI’s dynamic threshold rises faster, pruning more aggressively.
- Dense vs. sparse datasets: The performance gap is largest on dense datasets (Chess, Mushrooms) where pattern space is exponentially larger. On sparse datasets (Retail, Liquor), all methods benefit from natural sparsity, but TUFCI still maintains a consistent 1.5–2× speedup.
- Why TUFCI wins: The gap is not merely due to implementation differences—all three algorithms share the same Java framework and probabilistic support engine. The advantage arises from TUFCI’s support-ordered exploration, which elevates the dynamic threshold earlier and avoids the convolution cost for candidates that DFS-based methods are forced to evaluate before pruning.
All pairwise runtime differences between TUFCI and the external baselines are statistically significant at p < 0.01 (Wilcoxon signed-rank test, 5 runs × 4 datasets).
Exp 2: Internal variant comparison
Fig 4 evaluates the four internal variants (V1–V4) to isolate the effects of search strategy and pruning.
- Best-first vs. DFS (V1 vs. V2): With identical pruning (P1–P7), V1 outperforms V2 by 1.8–2.5× on dense datasets. This confirms that search order—not just pruning—is critical.
- Full vs. minimal pruning (V1 vs. V3, V2 vs. V4): Full pruning provides 1.3–1.6× speedup over minimal pruning within the same search strategy, demonstrating the value of advanced pruning (P4–P7).
- Synergy effect: The combination of best-first search AND full pruning (V1) yields super-additive benefits: V1 is 3.5× faster than V4 on Chess, exceeding the product of individual improvements.
Best-first search with full pruning (V1) consistently dominates.
All pairwise differences between V1–V4 are statistically significant at p < 0.01 (Wilcoxon signed-rank test, 5 runs × 4 datasets).
Exp 3: Efficiency of closure verification
Closure checking is computationally expensive as it requires verifying whether any superset has identical support. Fig 5 compares the number of closure checks across variants.
- V1 reduces checks by 60%: On Chess (k = 50), V1 (BestFS full pruning) performs 9,200 ± 340 checks vs. 23,800 ± 1,120 for V2 (DFS full pruning). With identical pruning strategies, this difference isolates the effect of search order: BestFS discovers top-k patterns early, elevating rapidly and discarding weak candidates via before expensive closure verification.
- DFS redundancy: DFS explores in enumeration order, keeping low longer and forcing closure checks on thousands of locally valid but globally suboptimal patterns.
Best-first search (V1) reduces closure verification overhead by 60% compared to DFS (V2).
The closure-check reduction between V1 and V2 is statistically significant at p < 0.01 (Wilcoxon signed-rank test, 5 runs × 4 datasets).
Exp 4: Pruning group ablation study
To quantify the contribution of each pruning group, we organized strategies P1–P7 into four semantic groups (Fig 6):
- G1 (Frontier): P1 (early termination) + P2 (dynamic threshold)
- G2 (Item): P3 (item-level filtering)
- G3 (Upper Bound): P4 (subset upper bound) + P5 (upper bound filtering)
- G4 (Tidset): P6 (tidset look-ahead) + P7 (tidset closure detection)
The red dashed line marks the FULL baseline (1.0×). G1 (Frontier termination) is the most critical group.
Fig 6 shows slowdown factor relative to FULL baseline across six configurations: FULL (baseline), NO_G1, NO_G2, NO_G3, NO_G4, and ONLY_G1. All values are means over 5 repetitions; error bars show ±1 standard deviation.
- G1 (Frontier) is most critical: Removing G1 causes 2.8 ± 0.12× slowdown on Chess, the largest degradation. ONLY_G1 alone achieves 1.4 ± 0.06× slowdown, demonstrating that early termination and dynamic thresholding provide the majority of benefit.
- G4 (Tidset) is second: Removing G4 causes 1.9 ± 0.09× slowdown. Tidset optimizations directly reduce probabilistic support computation overhead.
- G2 and G3 are complementary: Individually, removing G2 or G3 causes 1.3–1.5× slowdown (standard deviation ≤ 0.08× across all datasets).
All pairwise comparisons between configurations were assessed with a Wilcoxon signed-rank test (5 runs × 4 datasets); all reported differences are statistically significant at p < 0.01, confirming that no observed speedup is within measurement error.
Exp 5: Parameter sensitivity analysis
To validate the robustness of the synthetic uncertainty model and confirm that results are not artifacts of a single parameter configuration, we conducted a full sensitivity analysis on all four uncertainty injection parameters introduced in Section (Fig 7). The default configuration (, , , ) was varied one parameter at a time across the following ranges:
- (support-probability correlation exponent): [0.5, 1.0, 1.5, 2.0]
- (noise injection rate): [0.0, 0.02, 0.05, 0.10]
- (minimum existence probability): [0.05, 0.10, 0.20, 0.30]
- (maximum existence probability): [0.8, 0.9, 1.0]
Each subplot shows one parameter; lines represent datasets. TUFCI shows robust performance across all parameter ranges.
Fig 7 shows runtime sensitivity across datasets. Each subplot focuses on one parameter; each line represents one dataset.
- sensitivity: Higher (stronger correlation) slightly reduces runtime as high-support items become more certain, improving pruning. Effect is modest (5–10% variation).
- robustness: Runtime increases linearly with noise rate , but the effect is small (<15% increase from 0.0 to 0.10). TUFCI’s pruning strategies effectively handle noisy data.
- impact: Lower increases uncertainty, raising runtime by 20–30% as more candidates pass probabilistic threshold checks. Dense datasets (Chess, Mushrooms) are more sensitive.
- stability: Runtime is nearly invariant to (<5% change), indicating that top items’ probability has minimal impact once they exceed threshold.
Overall, TUFCI demonstrates robust performance across parameter ranges, with no pathological sensitivity. The synthetic injection method faithfully represents uncertain data characteristics. Runtime differences between parameter levels within each dataset are statistically significant at p < 0.01 (Wilcoxon signed-rank test, 5 runs per configuration).
Exp 6: Pruning group dominance analysis
To understand how pruning groups contribute on average across all four datasets, we measured two metrics (Fig 8). Note that Exp 4 reported peak slowdown on Chess (the densest dataset); Exp 6 reports average-case percentages aggregated across all four datasets (Chess, Mushrooms, Retail, Liquor), providing a complementary view of group importance under diverse data characteristics:
- Marginal benefit: Average runtime increase (%) when this group is removed across all datasets (measures necessity)
- Exclusive benefit: Average runtime decrease (%) when only this group is active across all datasets (measures sufficiency)
G1 (Frontier) shows the highest marginal and exclusive contributions.
Fig 8 presents group dominance across datasets.
- G1 dominates marginally: On Chess, removing G1 causes 64 ± 3% slowdown (marginal benefit), the highest. This confirms G1’s necessity.
- G1 dominates exclusively: G1 alone provides 42 ± 2% of FULL’s speedup (exclusive benefit), demonstrating sufficiency.
- G4 is second: G4 shows 38 ± 3% marginal and 22 ± 2% exclusive benefit on dense datasets, reflecting tidset optimization’s value.
- G2 and G3 show synergy: Their individual exclusive benefits (15 ± 2%, 18 ± 2%) are lower than their combined effect in FULL, indicating synergy with other groups.
All differences are statistically significant (Wilcoxon signed-rank, p < 0.01 across 5 runs × 4 datasets), confirming that no reported benefit is attributable to measurement variance.
This analysis guides algorithm tuning: prioritize G1 and G4 implementation; G2 and G3 provide complementary benefits.
Exp 7: Memory profiling
A common concern with best-first search is memory overhead from priority queue growth. Fig 9 profiles two memory metrics across variants V1–V4 using JVM memory instrumentation: (1) peak heap memory (MB) and (2) peak priority queue size (number of candidates simultaneously resident in ).
- V1 memory efficiency: V1 maintains stable memory on dense datasets (Chess: 45 ± 3 MB, Mushrooms: 78 ± 5 MB). The peak queue size for V1 on Chess is 1,840 ± 120 candidates, far below the theoretical worst-case O(|J|2) = 6,724 (for |J| = 82 items). Full pruning (P4–P7) prevents queue explosion by filtering weak candidates before insertion.
- DFS vs. BestFS: Surprisingly, V2 (DFS with full pruning) often consumes more memory than V1 on dense datasets (Chess: 52 ± 4 MB). DFS accumulates many candidates in the recursion stack before pruning; its effective “queue” (the call stack) is unbounded by k, unlike TUFCI’s heap which evicts the weakest element whenever .
- V3 (BestFS minimal) shows highest memory: Without advanced pruning, V3’s queue grows to 120 ± 12 MB and a peak size of 14,200 ± 980 candidates on Mushrooms, confirming that pruning is essential for queue size control, not just runtime.
- Memory-runtime trade-off: V1’s moderate memory overhead (10–15% higher than V2) is justified by 2× runtime improvement, making it optimal for applications where speed is prioritized.
V1 (BestFS + full pruning) maintains memory efficiency comparable to DFS while achieving superior runtime.
Pairwise differences in peak heap memory and peak queue size across V1–V4 are statistically significant at p < 0.01 (Wilcoxon signed-rank test, 5 runs × 4 datasets).
Summary of experimental findings
The seven experiments comprehensively validate TUFCI’s design across multiple dimensions:
- External competitiveness (Fig 3): TUFCI outperforms the adapted state-of-the-art baselines TopKPFIM and ITUFP by 2–3× on dense datasets, confirming best-first search superiority over traditional DFS methods.
- Search strategy impact (Fig 4): Best-first search with full pruning (V1) dominates all internal variants, achieving 3.5× speedup over naive baseline (V4).
- Closure efficiency (Fig 5): V1 (BestFS) reduces closure checks by 60% relative to V2 (DFS) under identical pruning, isolating the search-order benefit on closure verification efficiency.
- Pruning contributions (Fig 6): Group ablation reveals G1 (Frontier) as most critical (2.8× slowdown when removed), followed by G4 (Tidset, 1.9×).
- Parameter robustness (Fig 7): TUFCI shows stable performance across uncertainty parameter ranges (, , , ).
- Group synergy (Fig 8): Dominance analysis confirms G1’s necessity (64% marginal benefit) and sufficiency (42% exclusive benefit), guiding optimization priorities.
- Memory efficiency (Fig 9): V1 maintains comparable memory to DFS (10–15% overhead) while delivering 2× runtime gains, demonstrating favorable memory-speed trade-off.
All experiments used 5 repetitions with statistical testing (Wilcoxon signed-rank, p < 0.01), ensuring reproducibility. The results demonstrate that TUFCI’s best-first framework effectively addresses the computational challenges of top-k closed frequent itemset mining from uncertain databases.
Limitations
Two aspects of the evaluation bound the scope of these conclusions, and a third concerns the analysis.
Adapted baselines. TopKPFIM and ITUFP were not originally designed for top-k closed itemset mining over uncertain data. As detailed under “External Baseline Adaptations,” we added a post-hoc closedness filter to TopKPFIM and disabled the interactive re-mining of ITUFP so that both solve the same problem as TUFCI on identical inputs. The reported comparisons therefore measure TUFCI against these adapted variants rather than the methods exactly as published; the observed margins should be read with that caveat, since we did not re-engineer either method’s internal search to exploit closure, which could affect their relative standing.
Synthetic uncertainty Because the SPMF benchmarks are deterministic, existence probabilities were injected synthetically under a tuple-independent model with fixed generation parameters (, , , ). This controlled process does not necessarily reproduce the magnitude, distribution, or inter-item correlation of uncertainty found in real applications such as sensor readings or record linkage. Consequently, TUFCI’s behaviour on genuinely uncertain real-world data is not directly validated here; the present results characterise it under a standard, reproducible uncertainty model. The parameter-robustness study (Exp 5, Fig 7) mitigates but does not remove this limitation.
Worst-case complexity As established in the complexity analysis, TUFCI does not improve the worst-case asymptotic cost over depth-first mining; its gains are confined to the practical, average case driven by effective pruning and early termination.
Conclusion
This paper presented TUFCI, a best-first search algorithm for mining top-k closed frequent itemsets from uncertain databases. Unlike existing depth-first search approaches that explore candidates in enumeration order, TUFCI processes candidates in descending order of probabilistic support using a priority queue. This support-ordered exploration strategy addresses three fundamental limitations of DFS-based methods: delayed discovery of high-support patterns, inefficient closure verification, and inability to terminate early.
The main contributions of this work are summarized as follows. First, we proposed a best-first search framework that aligns the exploration order with the top-k mining objective, ensuring that strong patterns are discovered early and the dynamic pruning threshold rises rapidly. Second, we introduced a support-ordered closure verification strategy that examines supersets in descending support order, enabling early detection of closure violations and reducing redundant support computations. Third, we established a safe early termination condition that allows TUFCI to stop exploration when the best remaining candidate falls below the current threshold, a property that DFS-based approaches cannot exploit. Fourth, we developed multiple pruning strategies specifically designed for best-first traversal, including support-based pruning, upper bound pruning, and item-level filtering.
Experimental results demonstrated that TUFCI significantly outperforms DFS-based approaches in terms of runtime, particularly on dense datasets where the search space is large and pruning effectiveness is critical. The support-ordered closure verification reduced the number of closure checks by 60% compared to DFS with identical pruning (V1 vs. V2 on Chess), as early threshold elevation filters weak candidates before expensive verification. The early termination condition allowed TUFCI to avoid exploring large portions of the search space once the top-k patterns were identified. The best-first strategy also exhibited more consistent performance across different dataset characteristics, as the threshold elevation rate is determined by pattern supports rather than enumeration order.
The trade-off of the best-first approach is moderate additional memory consumption for maintaining the priority queue compared to the stack-based memory usage of DFS. However, this overhead is justified by the substantial runtime improvements, especially for applications where mining efficiency is prioritized over memory constraints.
Several directions remain for future work. First, the best-first framework could be extended to other pattern mining problems, including high-utility itemset mining and sequential pattern mining, where support-ordered exploration may provide similar benefits. Second, distributed and parallel implementations of TUFCI could be developed to handle large-scale uncertain databases that exceed single-machine capacity. Third, approximate mining techniques could be integrated to further reduce computational cost while providing probabilistic guarantees on result quality. Finally, the interaction between best-first search and other advanced pruning techniques, such as co-occurrence pruning and transaction merging, warrants further investigation to identify optimal combinations for different dataset characteristics.
In conclusion, this work demonstrates that the choice of search strategy significantly impacts the efficiency of top-k closed frequent itemset mining from uncertain databases. By replacing depth-first traversal with best-first exploration, TUFCI achieves substantial performance improvements through early pattern discovery, efficient closure verification, and safe early termination. The proposed approach provides a new perspective on uncertain pattern mining and opens opportunities for applying best-first search to related data mining problems.
References
- 1.
Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), Santiago, Chile, 1994. 487–99. https://dl.acm.org/doi/10.5555/645920.672836
- 2. Luna JM, Fournier-Viger P, Ventura S. Frequent Itemset Mining: A 25 Years Review. WIREs Data Mining and Knowledge Discovery. 2019;9(6):e1329.
- 3.
Bernecker T, Kriegel H-P, Renz M, Verhein F, Zuefle A. Probabilistic frequent itemset mining in uncertain databases. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009. 119–28. https://doi.org/10.1145/1557019.1557039
- 4. Aggarwal CC, Yu PS. A Survey of Uncertain Data Algorithms and Applications. IEEE Trans Knowl Data Eng. 2009;21(5):609–23.
- 5.
Pasquier N, Bastide Y, Taouil R, Lakhal L. Discovering Frequent Closed Itemsets for Association Rules. Lecture Notes in Computer Science. Springer Berlin Heidelberg. 1999. p. 398–416. https://doi.org/10.1007/3-540-49257-7_25
- 6. Jianyong Wang, Han J, Lu Y, Tzvetkov P. TFP: an efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans Knowl Data Eng. 2005;17(5):652–63.
- 7.
Wang J, Han J, Pei J. CLOSET : Searching for the Best Strategies for Mining Frequent Closed Itemsets. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003. 236–45. https://doi.org/10.1145/956750.956779
- 8.
Chui CK, Kao B, Hung E. Mining Frequent Itemsets from Uncertain Data. In: Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2007. 47–58. https://doi.org/10.1007/978-3-540-71701-0_8
- 9. Han J, Pei J, Yin Y, Mao R. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Mining and Knowledge Discovery. 2004;8(1):53–87.
- 10. Fournier-Viger P, Lin JCW, Vo B, Truong TC, Zhang J, Le HB. A Survey of Itemset Mining. WIREs Data Mining and Knowledge Discovery. 2017;7(4):e1207.
- 11. Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.
- 12.
Sun X, Lim L, Wang S. Mining Probabilistic Frequent Closed Itemsets in Uncertain Databases. In: Proceedings of the 49th Annual ACM Southeast Conference (ACM-SE), 2011. 170–5. https://doi.org/10.1145/2016039.2016068
- 13. Tseng VS, Wu C-W, Fournier-Viger P, Yu PS. Efficient Algorithms for Mining Top-K High Utility Itemsets. IEEE Trans Knowl Data Eng. 2016;28(1):54–67.
- 14. Duong Q-H, Liao B, Fournier-Viger P, Dam T-L. An efficient algorithm for mining the top- k high utility itemsets, using novel threshold raising and pruning strategies. Knowledge-Based Systems. 2016;104:106–22.
- 15. Ahmed U, Lin JC-W, Srivastava G, Yasin R, Djenouri Y. An Evolutionary Model to Mine High Expected Utility Patterns From Uncertain Databases. IEEE Trans Emerg Top Comput Intell. 2021;5(1):19–28.
- 16. Gan W, Lin JC-W, Fournier-Viger P, Chao H-C, Tseng VS, Yu PS. A Survey of Utility-Oriented Pattern Mining. IEEE Trans Knowl Data Eng. 2021;33(4):1306–27.
- 17. Singh K, Biswas B. Top-K High Utility Itemset Mining: Current Status and Future Directions. The Knowledge Engineering Review. 2024;39:e29.
- 18. Nguyen K, Nguyen T. Algorithms Based on Dynamic Minimum Probabilistic Support and Utility Thresholds for Mining Top-K High-Utility Itemsets From Uncertain Databases. IEEE Access. 2025;13:165222–46.
- 19.
Nguyen T, Nguyen T. Efficient extraction of top-k high-utility itemsets under periodicity constraints. Computational Intelligence in Engineering Science: Second International Conference, ICCIES 2026, Nha Trang, Vietnam, April 2–4, 2026, Proceedings, Part II. Springer Nature. 2026. p. 389.
- 20.
Li H, Zhang Y, Zhang N. Discovering Top-K Probabilistic Frequent Itemsets from Uncertain Databases. In: Procedia Computer Science, 2017. https://doi.org/10.1016/j.procs.2017.11.386
- 21. Davashi R. ITUFP: A fast method for interactive mining of Top-K frequent patterns from uncertain data. Expert Systems with Applications. 2023;214:119156.
- 22. Fournier-Viger P, Lin JCW, Gomariz A, Gueniche T, Soltani A, Deng Z. SPMF: A Java Open-Source Pattern Mining Library. Journal of Machine Learning Research. 2016;17(1):3389–93.
이 뉴스, 독자들은 어떻게 느꼈나요?
첫 반응을 남겨보세요로그인하면 감정 반응에 참여할 수 있어요.
관련 뉴스
관련 뉴스 제보는 로그인 후 가능합니다.
'research' 카테고리 뉴스
Correction: Air pollution in the places of <i>Betula pendula</i> growth and development changes the physicochemical properties and the main allergen content of its pollen
PLOS ONE
Correction: A treadmill training program in a gamified virtual reality environment combined with transcranial direct current stimulation in Parkinson’s Disease: Study protocol for a randomized controlled trial
PLOS ONE
Correction: A Preliminary proteomics-based assessment of biotic indicators in Central Mexican water bodies biotic indicators by proteomics in Mexican water bodies
PLOS ONE
PLOS의 다른 기사
“The impact of COVID-19 on employee productivity and the future of remote work”
PLOS ONE
Robot-assisted reconstruction using ureteroureterostomy and Lich–Gregoir ureteral reimplantation for complicated duplex kidneys in children
PLOS ONE
Spatiotemporal patterns of prevalence and mortality from respiratory infections and tuberculosis across Japan and its prefectures
PLOS ONE