Skip to content

DataFrame Parser

The DataFrame parser allows you to parse structured data from CSV, TSV, Excel, and other tabular formats into TypedLogic theories. It uses pandas for data loading and automatically infers predicate names from filenames.

Bases: Parser

A parser for tabular data files using pandas.

This parser reads CSV, TSV, Excel and other tabular formats and converts rows to logical facts (ground terms).

Source code in src/typedlogic/parsers/dataframe_parser.py
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
class DataFrameParser(Parser):
    """
    A parser for tabular data files using pandas.

    This parser reads CSV, TSV, Excel and other tabular formats
    and converts rows to logical facts (ground terms).
    """

    default_suffix = "csv"

    def __init__(self, **pandas_kwargs):
        """
        Initialize the parser.

        Args:
            **pandas_kwargs: Additional keyword arguments passed to pandas read functions
        """
        super().__init__()
        if pd is None:
            raise ImportError(
                "pandas is required for DataFrameParser. "
                "Install it with: pip install pandas"
            )
        self.pandas_kwargs = pandas_kwargs

    def parse(self, source: Union[Path, str, TextIO], **kwargs) -> Theory:
        """
        Parse tabular data into a Theory containing only ground terms.

        Since this parser only handles facts (not rules), it returns a Theory
        with empty sentences but populated ground_terms.

        Args:
            source: Path to file, file content as string, or file-like object
            **kwargs: Additional arguments passed to pandas

        Returns:
            Theory object with ground terms from the tabular data
        """
        ground_terms = self.parse_ground_terms(source, **kwargs)

        # Create a minimal theory with just the ground terms
        theory = Theory()
        theory.ground_terms = ground_terms
        return theory

    def parse_ground_terms(self, source: Union[Path, str, TextIO], **kwargs) -> List[Term]:
        """
        Parse tabular data and return a list of ground terms (facts).

        Args:
            source: Path to file, file content as string, or file-like object
            **kwargs: Additional arguments passed to pandas

        Returns:
            List of Term objects representing the rows as facts
        """
        # Determine predicate name from filename
        predicate_name = None
        if isinstance(source, Path):
            predicate_name = self._extract_predicate_name(source)
        elif isinstance(source, str) and not '\n' in source and not '\t' in source and not ',' in source:
            # Likely a filename string, not data content
            try:
                path = Path(source)
                if path.exists():
                    predicate_name = self._extract_predicate_name(path)
            except:
                pass

        # Read the data using pandas
        df = self._read_dataframe(source, **kwargs)

        # Convert to terms using existing utility
        if predicate_name:
            terms = dataframe_to_terms(df, predicate=predicate_name)
        else:
            # Fallback: use 'fact' as default predicate name
            terms = dataframe_to_terms(df, predicate='fact')

        return terms

    def _extract_predicate_name(self, path: Path) -> str:
        """
        Extract predicate name from file path.

        Takes the stem (filename without extension) and uses it as predicate name.
        Handles compound names like "Link.01" -> "Link"

        Args:
            path: Path object

        Returns:
            Predicate name extracted from filename
        """
        stem = path.stem
        # Handle cases like "Link.01.csv" -> "Link"
        return stem.split('.')[0]

    def _read_dataframe(self, source: Union[Path, str, TextIO], **kwargs) -> "pd.DataFrame":
        """
        Read data into a pandas DataFrame, auto-detecting format.

        Args:
            source: Data source
            **kwargs: Additional pandas arguments

        Returns:
            pandas DataFrame with the loaded data
        """
        # Merge parser-level pandas_kwargs with call-level kwargs
        read_kwargs = {**self.pandas_kwargs, **kwargs}

        if isinstance(source, Path):
            return self._read_from_path(source, **read_kwargs)
        elif isinstance(source, str):
            if '\n' in source or '\t' in source or ',' in source:
                # String contains data content
                return pd.read_csv(StringIO(source), **read_kwargs)
            else:
                # String is likely a file path
                return self._read_from_path(Path(source), **read_kwargs)
        elif isinstance(source, (TextIOWrapper, TextIO)):
            return pd.read_csv(source, **read_kwargs)
        else:
            raise ValueError(f"Unsupported source type: {type(source)}")

    def _read_from_path(self, path: Path, **kwargs) -> "pd.DataFrame":
        """
        Read DataFrame from a file path, auto-detecting format based on extension.

        Args:
            path: Path to the data file
            **kwargs: Additional pandas arguments

        Returns:
            pandas DataFrame with the loaded data
        """
        suffix = path.suffix.lower()

        if suffix == '.csv':
            return pd.read_csv(path, **kwargs)
        elif suffix in ['.tsv', '.tab']:
            # Override separator for TSV files
            kwargs['sep'] = kwargs.get('sep', '\t')
            return pd.read_csv(path, **kwargs)
        elif suffix in ['.xlsx', '.xls']:
            return pd.read_excel(path, **kwargs)
        elif suffix in ['.json']:
            return pd.read_json(path, **kwargs)
        elif suffix in ['.parquet']:
            return pd.read_parquet(path, **kwargs)
        else:
            # Default to CSV reader for unknown extensions
            return pd.read_csv(path, **kwargs)

    def validate_iter(self, source: Union[Path, str, TextIO, ModuleType], **kwargs):
        """
        Validate the tabular data file.

        Checks that the file can be read by pandas and contains valid tabular data.

        Args:
            source: Data source to validate
            **kwargs: Additional arguments

        Yields:
            ValidationMessage objects for any validation issues
        """
        from typedlogic.parser import ValidationMessage

        # Handle ModuleType - not applicable for DataFrame parser
        if isinstance(source, ModuleType):
            yield ValidationMessage(
                message="DataFrame parser does not support Python modules",
                level="error"
            )
            return

        try:
            df = self._read_dataframe(source, **kwargs)

            # Check if DataFrame is empty
            if df.empty:
                yield ValidationMessage(
                    message="DataFrame is empty - no facts will be generated",
                    level="warning"
                )

            # Check for completely empty columns (all NaN)
            empty_cols = df.columns[df.isnull().all()].tolist()
            if empty_cols:
                yield ValidationMessage(
                    message=f"Columns contain only missing values: {empty_cols}",
                    level="warning"
                )

            # Note: pandas automatically handles duplicate columns by renaming them
            # (e.g., 'name' becomes 'name', 'name.1', 'name.2', etc.)
            # So we don't need to check for duplicates - pandas handles this gracefully

        except Exception as e:
            yield ValidationMessage(
                message=f"Failed to read tabular data: {str(e)}",
                level="error"
            )

__init__(**pandas_kwargs)

Initialize the parser.

Args: **pandas_kwargs: Additional keyword arguments passed to pandas read functions

Source code in src/typedlogic/parsers/dataframe_parser.py
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def __init__(self, **pandas_kwargs):
    """
    Initialize the parser.

    Args:
        **pandas_kwargs: Additional keyword arguments passed to pandas read functions
    """
    super().__init__()
    if pd is None:
        raise ImportError(
            "pandas is required for DataFrameParser. "
            "Install it with: pip install pandas"
        )
    self.pandas_kwargs = pandas_kwargs

parse(source, **kwargs)

Parse tabular data into a Theory containing only ground terms.

Since this parser only handles facts (not rules), it returns a Theory with empty sentences but populated ground_terms.

Args: source: Path to file, file content as string, or file-like object **kwargs: Additional arguments passed to pandas

Returns: Theory object with ground terms from the tabular data

Source code in src/typedlogic/parsers/dataframe_parser.py
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
def parse(self, source: Union[Path, str, TextIO], **kwargs) -> Theory:
    """
    Parse tabular data into a Theory containing only ground terms.

    Since this parser only handles facts (not rules), it returns a Theory
    with empty sentences but populated ground_terms.

    Args:
        source: Path to file, file content as string, or file-like object
        **kwargs: Additional arguments passed to pandas

    Returns:
        Theory object with ground terms from the tabular data
    """
    ground_terms = self.parse_ground_terms(source, **kwargs)

    # Create a minimal theory with just the ground terms
    theory = Theory()
    theory.ground_terms = ground_terms
    return theory

parse_ground_terms(source, **kwargs)

Parse tabular data and return a list of ground terms (facts).

Args: source: Path to file, file content as string, or file-like object **kwargs: Additional arguments passed to pandas

Returns: List of Term objects representing the rows as facts

Source code in src/typedlogic/parsers/dataframe_parser.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
def parse_ground_terms(self, source: Union[Path, str, TextIO], **kwargs) -> List[Term]:
    """
    Parse tabular data and return a list of ground terms (facts).

    Args:
        source: Path to file, file content as string, or file-like object
        **kwargs: Additional arguments passed to pandas

    Returns:
        List of Term objects representing the rows as facts
    """
    # Determine predicate name from filename
    predicate_name = None
    if isinstance(source, Path):
        predicate_name = self._extract_predicate_name(source)
    elif isinstance(source, str) and not '\n' in source and not '\t' in source and not ',' in source:
        # Likely a filename string, not data content
        try:
            path = Path(source)
            if path.exists():
                predicate_name = self._extract_predicate_name(path)
        except:
            pass

    # Read the data using pandas
    df = self._read_dataframe(source, **kwargs)

    # Convert to terms using existing utility
    if predicate_name:
        terms = dataframe_to_terms(df, predicate=predicate_name)
    else:
        # Fallback: use 'fact' as default predicate name
        terms = dataframe_to_terms(df, predicate='fact')

    return terms

_extract_predicate_name(path)

Extract predicate name from file path.

Takes the stem (filename without extension) and uses it as predicate name. Handles compound names like "Link.01" -> "Link"

Args: path: Path object

Returns: Predicate name extracted from filename

Source code in src/typedlogic/parsers/dataframe_parser.py
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
def _extract_predicate_name(self, path: Path) -> str:
    """
    Extract predicate name from file path.

    Takes the stem (filename without extension) and uses it as predicate name.
    Handles compound names like "Link.01" -> "Link"

    Args:
        path: Path object

    Returns:
        Predicate name extracted from filename
    """
    stem = path.stem
    # Handle cases like "Link.01.csv" -> "Link"
    return stem.split('.')[0]

_read_dataframe(source, **kwargs)

Read data into a pandas DataFrame, auto-detecting format.

Args: source: Data source **kwargs: Additional pandas arguments

Returns: pandas DataFrame with the loaded data

Source code in src/typedlogic/parsers/dataframe_parser.py
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
def _read_dataframe(self, source: Union[Path, str, TextIO], **kwargs) -> "pd.DataFrame":
    """
    Read data into a pandas DataFrame, auto-detecting format.

    Args:
        source: Data source
        **kwargs: Additional pandas arguments

    Returns:
        pandas DataFrame with the loaded data
    """
    # Merge parser-level pandas_kwargs with call-level kwargs
    read_kwargs = {**self.pandas_kwargs, **kwargs}

    if isinstance(source, Path):
        return self._read_from_path(source, **read_kwargs)
    elif isinstance(source, str):
        if '\n' in source or '\t' in source or ',' in source:
            # String contains data content
            return pd.read_csv(StringIO(source), **read_kwargs)
        else:
            # String is likely a file path
            return self._read_from_path(Path(source), **read_kwargs)
    elif isinstance(source, (TextIOWrapper, TextIO)):
        return pd.read_csv(source, **read_kwargs)
    else:
        raise ValueError(f"Unsupported source type: {type(source)}")

_read_from_path(path, **kwargs)

Read DataFrame from a file path, auto-detecting format based on extension.

Args: path: Path to the data file **kwargs: Additional pandas arguments

Returns: pandas DataFrame with the loaded data

Source code in src/typedlogic/parsers/dataframe_parser.py
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
def _read_from_path(self, path: Path, **kwargs) -> "pd.DataFrame":
    """
    Read DataFrame from a file path, auto-detecting format based on extension.

    Args:
        path: Path to the data file
        **kwargs: Additional pandas arguments

    Returns:
        pandas DataFrame with the loaded data
    """
    suffix = path.suffix.lower()

    if suffix == '.csv':
        return pd.read_csv(path, **kwargs)
    elif suffix in ['.tsv', '.tab']:
        # Override separator for TSV files
        kwargs['sep'] = kwargs.get('sep', '\t')
        return pd.read_csv(path, **kwargs)
    elif suffix in ['.xlsx', '.xls']:
        return pd.read_excel(path, **kwargs)
    elif suffix in ['.json']:
        return pd.read_json(path, **kwargs)
    elif suffix in ['.parquet']:
        return pd.read_parquet(path, **kwargs)
    else:
        # Default to CSV reader for unknown extensions
        return pd.read_csv(path, **kwargs)

validate_iter(source, **kwargs)

Validate the tabular data file.

Checks that the file can be read by pandas and contains valid tabular data.

Args: source: Data source to validate **kwargs: Additional arguments

Yields: ValidationMessage objects for any validation issues

Source code in src/typedlogic/parsers/dataframe_parser.py
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
def validate_iter(self, source: Union[Path, str, TextIO, ModuleType], **kwargs):
    """
    Validate the tabular data file.

    Checks that the file can be read by pandas and contains valid tabular data.

    Args:
        source: Data source to validate
        **kwargs: Additional arguments

    Yields:
        ValidationMessage objects for any validation issues
    """
    from typedlogic.parser import ValidationMessage

    # Handle ModuleType - not applicable for DataFrame parser
    if isinstance(source, ModuleType):
        yield ValidationMessage(
            message="DataFrame parser does not support Python modules",
            level="error"
        )
        return

    try:
        df = self._read_dataframe(source, **kwargs)

        # Check if DataFrame is empty
        if df.empty:
            yield ValidationMessage(
                message="DataFrame is empty - no facts will be generated",
                level="warning"
            )

        # Check for completely empty columns (all NaN)
        empty_cols = df.columns[df.isnull().all()].tolist()
        if empty_cols:
            yield ValidationMessage(
                message=f"Columns contain only missing values: {empty_cols}",
                level="warning"
            )

        # Note: pandas automatically handles duplicate columns by renaming them
        # (e.g., 'name' becomes 'name', 'name.1', 'name.2', etc.)
        # So we don't need to check for duplicates - pandas handles this gracefully

    except Exception as e:
        yield ValidationMessage(
            message=f"Failed to read tabular data: {str(e)}",
            level="error"
        )

Usage Examples

Basic CSV Parsing

from typedlogic.parsers.dataframe_parser import DataFrameParser

parser = DataFrameParser()

# Parse CSV file - predicate name inferred from filename
theory = parser.parse("people.csv")  # Creates predicates like people(name, age, city)

# Parse with explicit predicate name
terms = parser.parse_ground_terms("data.csv", predicate="person")

Supported Formats

The DataFrame parser automatically detects and handles:

  • CSV files (.csv)
  • TSV files (.tsv, .tab)
  • Excel files (.xlsx, .xls)

Predicate Name Inference

The parser automatically infers predicate names from filenames:

  • people.csvpeople(...) predicates
  • Link.csvLink(...) predicates
  • data.tsvdata(...) predicates

Integration with CLI

The DataFrame parser is automatically used when processing CSV/TSV files via the CLI:

# Convert CSV to other formats
typedlogic dump people.csv -t yaml

# Combine multiple CSV files
typedlogic dump people.csv companies.csv -t prolog

# Use in catalog files
typedlogic dump my_dataset.catalog.yaml

Configuration

The parser accepts standard pandas parameters for customization:

# Custom separator and headers
terms = parser.parse_ground_terms("data.txt", sep="|", header=0)

# Skip rows and handle missing values
terms = parser.parse_ground_terms("messy_data.csv", skiprows=2, na_values=["N/A"])