Download and Install RegiStream for Python

Our Philosophy

RegiStream for Python is designed to integrate seamlessly with pandas, making the labeling process as intuitive as possible.

We use pandas accessors to provide a natural, pandas-native experience. By simply adding .lab to your DataFrame or Series, you gain access to labeled versions of your data. This approach means you can continue using all the pandas functionality you already know, with the added benefit of proper variable and value labels.

Whether you're previewing data, creating tables, plotting, or running regressions, RegiStream enhances your workflow without changing it - you just add .lab to bring your labels into the picture.

Option 1: Install via pip

To quickly install RegiStream directly from PyPI, use the following command in your terminal:

pip install registream

This command will download and install the latest version of RegiStream, including all necessary dependencies.

Verification and Uninstall

To verify the installation, run the following commands in your Python console:

import registream
registream.__version__

To uninstall the RegiStream package, run the following command in your terminal:

pip uninstall registream

A Note About Data Files

By default, RegiStream will store data files in your system's user home directory. This enables seamless sharing of label data across projects and programming languages like Stata, R, and Python. The default storage locations are:

If you are working on a secure system that does not have a standard user home directory (e.g., MONA or other high-security environments), you will need to specify a custom directory for RegiStream to store the required files. You can do this by setting the environment variable REGISTREAM_DIR at the beginning of your script:

import os
os.environ['REGISTREAM_DIR'] = "path/to/custom/directory"

Replace "path/to/custom/directory" with the actual path where you want RegiStream to store the data files.

Alternatively, you can set the environment variable system-wide to avoid adding it to each script:

Option 2: Installation in Secure Environments

If you are working on an offline server or a high-security environment (e.g., MONA), you can manually download the RegiStream wheel file and transfer it to the secure system.

Download from PyPI

After downloading, to install the package manually, follow these steps:

  1. Download the wheel file (*.whl) from PyPI on your local system.
  2. Transfer the wheel file to your secure server system through the inbox.
  3. On the secure system, install the package from the local wheel file:
  4. pip install /path/to/registream-x.x.x-py3-none-any.whl

A Note About Data Files

Once the RegiStream package is installed, you will also need the data files. Due to the 10MB file size limit for secure environments like MONA, the data files are split into multiple smaller downloads. You can download each file and transfer them to the secure system.

Note: Each file in the data directories is under 10MB to comply with the upload limits on secure systems like MONA.

Since secure systems may not have a standard user home directory, you will need to create a custom directory for RegiStream data. You can choose any location, but we recommend creating a folder named registream. Within this folder, create subfolders for each data type you download. For example:

# create the following directories for registream data
path-to-your-favorite-folder/registream
path-to-your-favorite-folder/registream/autolabel_keys

It is important that the folder names match exactly as indicated (e.g., scb_variables_swe). Once the CSV files are in the appropriate folders, RegiStream will automatically recognize and use them when you run your code.

If you do not place the data files in the default directories, you need to set the environment variable REGISTREAM_DIR to specify the custom path:

import os
os.environ['REGISTREAM_DIR'] = "path-to-your-favorite-folder/registream"

Replace "path-to-your-favorite-folder/registream" with the actual path where you created the registream folder and its subfolders.

As mentioned in Option 1, you can also set this environment variable system-wide rather than in each script. This is especially useful in secure environments where you'll be consistently using the same data location.

Example Usage Instructions

Here are some examples of how to use RegiStream with pandas DataFrames in Python:

1. Applying Labels to a DataFrame:
import pandas as pd
import registream

# Load your data
lisa_df = pd.read_stata('path/to/your/data.dta')

# Apply variable labels
lisa_df.autolabel(domain='scb', lang='eng')

# Apply value labels
lisa_df.autolabel(label_type='values', domain='scb', lang='eng')
2. Previewing DataFrames with Labels:
# Preview the dataset without variable labels
cols_to_show = ['astsni2007', 'ssyk3', 'astsni2002']
lisa_df[cols_to_show].head()

# Preview the dataset with variable labels
lisa_df[cols_to_show].lab.head()

# Show both variable and value labels
lisa_df[cols_to_show].lab.show_values().head()
3. Tabulating Data with Labels:
# Tabulate a variable without value labels
lisa_df.astsni2007.value_counts()

# Tabulate a variable with value labels
lisa_df.lab.astsni2007.value_counts()
4. Creating Labeled Plots:
import matplotlib.pyplot as plt
import seaborn as sns

# Aggregate data by year and employment status
agg_df = lisa_df.groupby(["examar", "syssstatj"], as_index=False).agg({"inkpens": "mean"})

# Create a plot with labeled variables
plt.figure(figsize=(10, 6))
ax = sns.scatterplot(data=agg_df.lab, x='examar', y='inkpens', hue='syssstatj', palette="viridis")

# Style adjustments
plt.title('Pension Income by Year and Employment Status')
plt.xlabel('Year')
plt.ylabel('Average Pension Income')
plt.xticks(rotation=45)

plt.show()
5. Regression Analysis with Labels:
import statsmodels.api as sm

# Extract labeled variables
income = lisa_df.lab['dispinkfam04']  # Numeric variable
industry = lisa_df.lab['astsni2007']  # Categorical variable
outcome = lisa_df.lab['dispink04']    # Target variable

# Convert categorical variable into one-hot encoding
X = pd.get_dummies(industry, drop_first=True, dtype=float)
X[income.name] = income.astype(float)

# Add intercept
X = sm.add_constant(X)

# Run the regression
model = sm.OLS(outcome.astype(float), X).fit()
print(model.summary())
6. Metadata Search:
# Search for variables containing 'industry' in their name or label
lisa_df.meta_search("industry")

This will display all variables related to "industry" in your dataset.

7. Getting and Setting Variable Labels:
# Get a single variable label
lisa_df.get_variable_labels('astsni2007')

# Get multiple variable labels
lisa_df.get_variable_labels(['astsni2007', 'ssyk3'])

# Set a variable label
lisa_df.set_variable_labels('astsni2007', "Industry classification (SNI 2007)")
8. Getting and Setting Value Labels:
# Get value labels for a variable
x = lisa_df.get_value_labels('astsni2007')

# Update value labels
updates = {'00000': 'custom label 1', '01110': 'custom label 2'}
lisa_df.set_value_labels('astsni2007', updates)
9. Looking Up Variables in the Data Domain:
from registream import lookup

# Lookup a variable in the SCB domain
lookup('astsni2007', domain='scb', lang='eng')

Additional Resources

For more detailed documentation and examples, visit our: