API Reference¶
The package is divided into two main independent subpackages: rdm.db
and rdm.wrappers
.
Database interaction¶
Databases can be accessed via different so-called data sources. You can add your own data source by subclassing the base rdm.db.datasource.DataSource
class.
Base DataSource¶
-
class
rdm.db.datasource.
DataSource
[source]¶ A data abstraction layer for accessing datasets.
This layer is typically hidden from end-users, as they only access the database through DBConnection and DBContext objects.
-
column_values
(table, col)[source]¶ Returns a list of distinct values for the given table and column.
param table: target table param cols: list of columns to select
-
connected
(tables, cols, find_connections=False)[source]¶ Returns a list of tuples of connected table pairs.
param tables: a list of table names param cols: a list of column names param find_connections: set this to True to detect relationships from column names. return: a tuple (connected, pkeys, fkeys, reverse_fkeys)
-
fetch
(table, cols)[source]¶ Fetches rows for the given table and columns.
param table: target table param cols: list of columns to select return: rows from the given table and columns rtype: list
-
fetch_types
(table, cols)[source]¶ Returns a dictionary of field types for the given table and columns.
param table: target table param cols: list of columns to select return: a dictionary of types for each attribute rtype: dict
-
foreign_keys
()[source]¶ Returns: a list of foreign key relations in the form (table_name, column_name, referenced_table_name, referenced_column_name). Return type: list
-
select_where
(table, cols, pk_att, pk)[source]¶ Select with where clause.
param table: target table param cols: list of columns to select param pk_att: attribute for the where clause param pk: the id that the pk_att should match return: rows from the given table and cols, with the condition pk_att==pk rtype: list
-
table_column_names
()[source]¶ Returns: a list of table / column names in the form (table, col_name). Return type: list
-
table_columns
(table_name)[source]¶ Parameters: table_name – table name for which to retrieve column names Returns: a list of columns for the given table. Return type: list
-
MySQLDataSource¶
PgSQLDataSource¶
Database Context¶
A DBContext
object represents a view of a particular data source that can be used for learning. Example uses include: selecting only particular tables, table columns, a target attribute, and so on.
-
class
rdm.db.context.
DBContext
(connection, target_table=None, target_att=None, find_connections=False, in_memory=True)[source]¶ -
__init__
(connection, target_table=None, target_att=None, find_connections=False, in_memory=True)[source]¶ Initializes a new DBContext object from the given DBConnection.
Parameters: - connection – a DBConnection instance
- target_table – set a target table for learning
- target_att – set a target table attribute for learning
- find_connections – set to True if you want to detect relationships based on attribute and table names, e.g.,
train_id
is the foreign key refering toid
in tabletrain
. - in_memory – Load the database into main memory (currently required for most approaches and pre-processing)
-
copy
()[source]¶ Makes a deepcopy of the DBContext object (e.g., for making folds)
returns: a deep copy of self
.rtype: DBContext
-
fetch
(table, cols)[source]¶ Fetches rows from the db.
param table: table name to select cols: list of columns to select return: list of rows rtype: list
-
fetch_types
(table, cols)[source]¶ Returns a dictionary of field types for the given table and columns.
param table: target table param cols: list of columns to select return: a dictionary of types for each attribute rtype: dict
-
rows
(table, cols)[source]¶ Fetches rows from the local cache or from the db if there’s no cache.
param table: table name to select cols: list of columns to select return: list of rows rtype: list
-
select_where
(table, cols, pk_att, pk)[source]¶ SELECT with WHERE clause.
param table: target table param cols: list of columns to select param pk_att: attribute for the where clause param pk: the id that the pk_att should match return: rows from the given table and cols, with the condition pk_att==pk rtype: list
-
Database converters¶
Converters are used to change the representation of the input database to a native representation of a particular algorithm.
-
class
rdm.db.converters.
ILPConverter
(*args, **kwargs)[source]¶ Base class for converting between a given database context (selected tables, columns, etc) to inputs acceptable by a specific ILP system.
param discr_intervals: (optional) discretization intervals in the form: >>> {'table1': {'att1': [0.4, 1.0], 'att2': [0.1, 2.0, 4.5]}, 'table2': {'att2': [0.02]}}
given these intervals, e.g.,
att1
would be discretized into three intervals:att1 =< 0.4, 0.4 < att1 =< 1.0, att1 >= 1.0
param settings: dictionary of setting: value
pairs-
mode
(predicate, args, recall=1, head=False)[source]¶ Emits mode declarations in Aleph-like format.
param predicate: predicate name param args: predicate arguments with input/output specification, e.g.: >>> [('+', 'train'), ('-', 'car')]
param recall: recall setting (see Aleph manual) param head: set to True for head clauses
-
-
class
rdm.db.converters.
RSDConverter
(*args, **kwargs)[source]¶ Converts the database context to RSD inputs.
Inherits from ILPConverter.
-
class
rdm.db.converters.
AlephConverter
(*args, **kwargs)[source]¶ Converts the database context to Aleph inputs.
Inherits from ILPConverter.
-
__init__
(*args, **kwargs)[source]¶ Parameters: discr_intervals – (optional) discretization intervals in the form: >>> {'table1': {'att1': [0.4, 1.0], 'att2': [0.1, 2.0, 4.5]}, 'table2': {'att2': [0.02]}}
given these intervals, e.g.,
att1
would be discretized into three intervals:att1 =< 0.4, 0.4 < att1 =< 1.0, att1 >= 1.0
Parameters: - settings – dictionary of
setting: value
pairs - target_att_val – target attribute value for learning.
- settings – dictionary of
-
-
class
rdm.db.converters.
OrangeConverter
(*args, **kwargs)[source]¶ Converts the selected tables in the given context to Orange example tables.
-
convert_table
(table_name, cls_att=None)[source]¶ Returns the specified table as an orange example table.
param table_name: table name to convert cls_att: class attribute name rtype: orange.ExampleTable
-
orng_type
(table_name, col)[source]¶ Returns an Orange datatype for a given mysql column.
param table_name: target table name param col: column to determine the Orange datatype
-
-
class
rdm.db.converters.
TreeLikerConverter
(*args, **kwargs)[source]¶ Converts a db context to the TreeLiker dataset format.
param discr_intervals: (optional) discretization intervals in the form: >>> {'table1': {'att1': [0.4, 1.0], 'att2': [0.1, 2.0, 4.5]}, 'table2': {'att2': [0.02]}}
given these intervals, e.g.,
att1
would be discretized into three intervals:att1 =< 0.4, 0.4 < att1 =< 1.0, att1 >= 1.0
Algorithm wrappers¶
The rdm.wrappers
module provides classes for working with the various algorithm wrappers.
Aleph¶
This is a wrapper for the very popular ILP algorithm Aleph. Aleph is an ILP toolkit with many modes of functionality: learning theories, feature construction, incremental learning, etc. Aleph uses mode declarations to define the syntactic bias. Input relations are Prolog clauses, defined either extensionally or intensionally.
See Getting started for an example of using Aleph in your python code.
-
class
rdm.wrappers.
Aleph
(verbosity=0)[source]¶ Aleph python wrapper.
-
__init__
(verbosity=0)[source]¶ Creates an Aleph object.
param logging: Can be DEBUG, INFO or NOTSET (default). This controls the verbosity of the output.
-
__weakref__
¶ list of weak references to the object (if defined)
-
induce
(mode, pos, neg, b, filestem='default', printOutput=False)[source]¶ Induce a theory or features in ‘mode’.
param filestem: The base name of this experiment. param mode: In which mode to induce rules/features. param pos: String of positive examples. param neg: String of negative examples. param b: String of background knowledge. return: The theory as a string or an arff dataset in induce_features mode. rtype: str
-
set
(name, value)[source]¶ Sets the value of setting ‘name’ to ‘value’.
param name: Name of the setting param value: Value of the setting
-
RSD¶
RSD is a relational subgroup discovery algorithm (Zelezny et al, 2001) composed of two main steps: the propositionalization step and the (optional) subgroup discovery step. RSD effectively produces an exhaustive list of first-order features that comply with the user-defined mode constraints, similar to those of Progol (Muggleton, 1995) and Aleph.
See Example use case for an example of using RSD in your code.
-
class
rdm.wrappers.
RSD
(verbosity=0)[source]¶ RSD python wrapper.
-
__init__
(verbosity=0)[source]¶ Creates an RSD object.
param logging: Can be DEBUG, INFO or NOTSET (default). This controls the verbosity of the output.
-
__weakref__
¶ list of weak references to the object (if defined)
-
induce
(b, filestem='default', examples=None, pos=None, neg=None, cn2sd=True, printOutput=False)[source]¶ Generate features and find subgroups.
param filestem: The base name of this experiment. param examples: Classified examples; can be used instead of separate pos / neg files below. param pos: String of positive examples. param neg: String of negative examples. param b: String with background knowledge. param cn2sd: Find subgroups after feature construction? return: a tuple (features, weka, rules)
, where:- features is a set of prolog clauses of generated features,
- weka is the propositional form of the input data,
- rules is a set of generated cn2sd subgroup descriptions; this will be an empty string if cn2sd is set to False.
rtype: tuple
-
TreeLiker¶
TreeLiker (by Ondrej Kuzelka et al) is suite of multiple algorithms (controlled by the algorithm
setting), RelF, Poly and HiFi:
RelF constructs a set of tree-like relational features by combining smaller conjunctive blocks. The novelty is that RelF preserves the monotonicity of feature reducibility and redundancy (instead of the typical monotonicity of frequency), which allows the algorithm to scale far better than other state-of-the-art propositionalization algorithms.
HiFi is a propositionalization approach that constructs first-order features with hierarchical structure. Due to this feature property, the algorithm performs the transformation in polynomial time of the maximum feature length. Furthermore, the resulting features are the smallest in their semantic equivalence class.
Example usage:
>>> context = DBContext(...)
>>> conv = TreeLikerConverter(context)
>>> treeliker = TreeLiker(conv.dataset(), conv.default_template()) # Runs RelF by default
>>> arff, _ = treeliker.run()
-
class
rdm.wrappers.
TreeLiker
(dataset, template, test_dataset=None, settings={})[source]¶ TreeLiker python wrapper.
-
__init__
(dataset, template, test_dataset=None, settings={})[source]¶ Parameters: - dataset – dataset in TreeLiker format
- template – feature template
- test_dataset – (optional) test dataset to transform with the features from the training set
- settings – dictionary of settings (see TreeLiker documentation)
-
Wordification¶
Wordification (Perovsek et al, 2015) is a propositionalization method inspired by text mining that can be viewed as a transformation of a relational database into a corpus of text documents. Wordification constructs simple, easily interpretable features, acting as words in the transformed Bag-Of-Words representation.
Example usage:
>>> context = DBContext(...)
>>> orange = OrangeConverter(context)
>>> wordification = Wordification(orange.target_Orange_table(), orange.other_Orange_tables(), context)
>>> wordification.run(1)
>>> wordification.calculate_weights()
>>> arff = wordification.to_arff()
-
class
rdm.wrappers.
Wordification
(target_table, other_tables, context, word_att_length=1, idf=None)[source]¶ -
__init__
(target_table, other_tables, context, word_att_length=1, idf=None)[source]¶ Wordification object constructor.
param target_table: Orange ExampleTable, representing the primary table param other_tables: secondary tables, Orange ExampleTables
-
__weakref__
¶ list of weak references to the object (if defined)
-
att_to_s
(att)[source]¶ Constructs a “wordification” word for the given attribute
param att: Orange attribute
-
calculate_weights
(measure='tfidf')[source]¶ Counts word frequency and calculates tf-idf values for words in every document.
param measure: example weights approach (can be one of tfidf, binary, tf
).
-
prune
(minimum_word_frequency_percentage=1)[source]¶ Filter out words that occur less than minimum_word_frequency times.
param minimum_word_frequency_percentage: minimum frequency of words to keep
-
Proper¶
-
class
rdm.wrappers.
Proper
(input_dict, is_relaggs)[source]¶ -
__dict__
= dict_proxy({'__module__': 'rdm.wrappers.proper.proper', 'run': <function run>, 'init_args_list': <function init_args_list>, 'parse_excluded_fields': <function parse_excluded_fields>, '__dict__': <attribute '__dict__' of 'Proper' objects>, '__weakref__': <attribute '__weakref__' of 'Proper' objects>, '__doc__': None, '__init__': <function __init__>})¶
-
__module__
= 'rdm.wrappers.proper.proper'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
Tertius¶
-
class
rdm.wrappers.
Tertius
(input_dict)[source]¶ -
__dict__
= dict_proxy({'__module__': 'rdm.wrappers.tertius.tertius', 'run': <function run>, 'init_args_list': <function init_args_list>, '__dict__': <attribute '__dict__' of 'Tertius' objects>, '__weakref__': <attribute '__weakref__' of 'Tertius' objects>, '__doc__': None, '__init__': <function __init__>})¶
-
__module__
= 'rdm.wrappers.tertius.tertius'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
OneBC¶
-
class
rdm.wrappers.
OneBC
(input_dict, is1BC2)[source]¶ -
__dict__
= dict_proxy({'__module__': 'rdm.wrappers.tertius.onebc', 'run': <function run>, 'init_args_list': <function init_args_list>, '__dict__': <attribute '__dict__' of 'OneBC' objects>, '__weakref__': <attribute '__weakref__' of 'OneBC' objects>, '__doc__': None, '__init__': <function __init__>})¶
-
__module__
= 'rdm.wrappers.tertius.onebc'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
Caraf¶
-
class
rdm.wrappers.
Caraf
(input_dict)[source]¶ -
__dict__
= dict_proxy({'__module__': 'rdm.wrappers.caraf.caraf', 'run': <function run>, '__dict__': <attribute '__dict__' of 'Caraf' objects>, '__weakref__': <attribute '__weakref__' of 'Caraf' objects>, '__doc__': None, '__init__': <function __init__>})¶
-
__module__
= 'rdm.wrappers.caraf.caraf'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
Utilities¶
This section documents helper utilities provided by the python-rdm package that are useful in various scenarios.
Mapping unseen examples into propositional feature space¶
When testing classifiers (or in a real-world scenario) you’ll need to map unseen (or new) examples into
the feature space used by the classifier. In order to do this, use the rdm.db.mapper
function.
See Example use case for usage in a cross-validation setting.
-
rdm.db.mapper.
domain_map
(features, feature_format, train_context, test_context, intervals={}, format='arff', positive_class=None)[source]¶ Use the features returned by a propositionalization method to map unseen test examples into the new feature space.
param features: string of features as returned by rsd, aleph or treeliker param feature_format: ‘rsd’, ‘aleph’, ‘treeliker’ param train_context: DBContext with training examples param test_context: DBContext with test examples param intervals: discretization intervals (optional) param format: output format (only arff is used atm) param positive_class: required for aleph return: returns the test examples in propositional form rtype: str Example: >>> test_arff = mapper.domain_map(features, 'rsd', train_context, test_context)
Validation¶
Python-rdm provides a helper function for splitting a dataset into folds for cross-validation.
See Example use case for a cross-validation example using RSD.
-
rdm.validation.
cv_split
(context, folds=10, random_seed=None)[source]¶ Returns a list of pairs (train_context, test_context), one for each cross-validation fold.
The split is stratified.
param context: DBContext to be split param folds: number of folds param random_seed: random seed to be used return: returns a list of (train_context, test_context) pairs rtype: list Example: >>> for train_context, test_context in cv_split(context, folds=10, random_seed=0): >>> pass # Your CV loop