๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

์ปดํ“จํ„ฐ๊ณตํ•™๊ณผ/Information Retrieval

Boolean Retrieval

์ •๋ณด๊ฒ€์ƒ‰ : ๋งŽ์€ ๋ฌธ์„œ ์ง‘ํ•ฉ์—์„œ ์›ํ•˜๋Š” ์ •๋ณด๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋“ค์—์„œ ํ•„์š”ํ•œ ์ž๋ฃŒ๋ฅผ ์ฐพ๋Š” ๊ฒƒ

1996๋…„) TEXT์™€ ๊ฐ™์ด ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ์˜ ๊ทœ๋ชจ๊ฐ€ ๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋ณด๋‹ค ์ปธ์Œ. ๊ทผ๋ฐ ์‹œ์žฅ์—์„œ๋Š” ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ฑด ๋ˆ์ด ๋˜์ง€ ์•Š์•˜์Œ

2006๋…„) ์—ฌ์ „ํžˆ ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ์˜ ๊ทœ๋ชจ๊ฐ€ ํฌ์ง€๋งŒ TEXT ๊ฒ€์ƒ‰ ์„œ๋น„์Šค ์‹œ์žฅ์ด ์ปค์ง€๋ฉด์„œ ๋น„๊ตฌ์กฐํ™” ๋ฌธ์„œ๋ฅผ ๋‹ค๋ฃจ๋Š” ์‹œ์žฅ์ด ์„ฑ์žฅํ•จ

์ •๋ณด๊ฒ€์ƒ‰์‹œ์Šคํ…œ์ด ์—†๋‹ค๋ฉด, ์œ ๋‹‰์Šค์˜ grep๋ช…๋ น์–ด, | ๋กœ ์ฐพ์„ ์ˆ˜ ์žˆ์Œ. ํ•˜์ง€๋งŒ, ๋ฌธ์„œ์˜ ์–‘์ด ํฌ๋‹ค๋ฉด ์†๋„๊ฐ€ ๋А๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— ์ •๋ณด ๊ฒ€์ƒ‰ ๋ชฉ์ ์œผ๋กœ๋Š” ์ ์ ˆํ•˜์ง€ ์•Š๋‹ค. ํŠน์ • ๋‚ด์šฉ์ด ๋“ค์–ด๊ฐ€ ์žˆ๋Š” ๊ฑด ๋“ค์–ด๊ฐ€ ์žˆ์œผ๋ฉด ๋” ํƒ์ƒ‰ํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋‹ˆ ์‰ฌ์šด ํŽธ์ด์ง€๋งŒ, ํŠน์ • ๋‚ด์šฉ์ด ๋“ค์–ด๊ฐ€ ์žˆ์ง€ ์•Š์€ ๊ฑด ํŒŒ์ผ์„ ๋๊นŒ์ง€ ๋‹ค ํƒ์ƒ‰ํ•ด ๋ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋ ต๋‹ค. ๋‹จ์–ด๊ฐ€ ์ธ์ ‘ํ•ด ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋Š” grep ๋ช…๋ น์œผ๋กœ ์ฐพ์„ ์ˆœ ์—†๋‹ค.

Boolean ๋ชจ๋ธ์€ ๋‹จ์–ด ํฌํ•จ ์—ฌ๋ถ€๋งŒ ์•Œ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋กœ ์ถ”์ถœ๋œ ๋ฌธ์„œ๋“ค์˜ ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๊ธด ํž˜๋“ค๋‹ค. grep ๋˜ํ•œ, Boolean ๋ชจ๋ธ์— ๋”ฐ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๊ธด ํž˜๋“ค๋‹ค.

Boolean ๋ชจ๋ธ๋„ ์ ์ˆ˜๋ฅผ ๋งค๊ฒจ์„œ Sortํ•  ์ˆ˜ ์žˆ๊ธด ํ•˜์ง€๋งŒ, ๊ธฐ๋ณธ์ ์œผ๋ก  ์ˆœ์œ„ ๋งค๊ธธ ์ˆ˜ ์—†๋‹ค.

$ grep -v <์งˆ์˜์–ด> * : ๋ชจ๋“  ๋ฌธ์„œ ํŒŒ์ผ(*)์—์„œ ์งˆ์˜์–ด๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์€(-v) ๋ฌธ์„œ๋ฅผ ํƒ์ƒ‰ํ•œ๋‹ค.

Term-document (incidence) matrix

Term๊ณผ Document์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ–‰๋ ฌ

๊ฒ€์ƒ‰ ๋Œ€์ƒ์ด ๋˜๋Š” ๋ชจ๋“  ๋ฌธ์„œ์— ๋Œ€ํ•ด, Term์ด Document์— ๋‚˜์˜ค๋ฉด 1, ๋‚˜์˜ค์ง€ ์•Š์œผ๋ฉด 0

์งˆ์˜์–ด์™€ ์ผ์น˜ํ•˜๋Š” Term์˜ ํ–‰๋“ค์„ ์ฐพ์•„์„œ ์ฃผ์–ด์ง„ ์งˆ์˜์–ด์— ๋”ฐ๋ผ ํ–‰๋ผ๋ฆฌ AND, OR, NOT ์—ฐ์‚ฐ ์ˆ˜ํ–‰

์ด๋•Œ, NOT์€ 1๊ณผ 0์„ ๋งž๋ฐ”๊ฟˆ.

bitwise : ๊ฐ™์€ ์œ„์น˜์˜ bit๋ผ๋ฆฌ ์—ฐ์‚ฐ

์˜ˆ) ์งˆ์˜์–ด : Brutus AND Caesar but NOT Calpurnia

Brutus = 110100, Caesar = 110111, Calpurnia = 010000

์—ฐ์‚ฐ : 110100 AND 110111 AND 101111 = 100100

→ 1, 4๋ฒˆ์งธ ๋ฌธ์„œ๊ฐ€ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จ๋˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ํฌํ•จ๋˜์ง€ ์•Š์Œ.

Term-document (incidence) matrix๊ฐ€ ์ฃผ์–ด์ ธ ์žˆ๋‹ค๋ฉด, ์งˆ์˜์–ด ํ‚ค์›Œ๋“œ์— ํ•ด๋‹นํ•˜๋Š” Term์˜ ํ–‰์„ ์ฐพ์•„์„œ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•ด๋‚ผ ์ˆ˜ ์žˆ์Œ. (Boolean ๊ฒ€์‚ฌ ๊ธฐ๋ณธ ๋ฐฉ๋ฒ•๋ก )

Boolean Model ๊ธฐ๋ณธ ๊ฐ€์ • : ๋ฌธ์„œ์˜ ์ง‘ํ•ฉ์€ ๊ณ ์ •๋˜์–ด ์žˆ์Œ.

๋ง๋ญ‰์น˜ : Corpus, ๋ฌธ์„œ์˜ ์ง‘ํ•ฉ

Refine : ์‚ฌ๋žŒ์ด๋‚˜ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์ด ์งˆ์˜์–ด๋ฅผ ๋” ์ •๊ตํ•˜๊ฒŒ ์ˆ˜์ •ํ•ด์„œ ์žฌ๊ฒ€์ƒ‰์„ ํ•˜๋Š” ํ–‰์œ„

๊ฒ€์ƒ‰๋„ ์‹œ์Šคํ…œ ํ‰๊ฐ€ ๊ธฐ์ค€

Precision ์ •๋ฐ€๋„ : ๊ฒฐ๊ณผ๊ฐ€ ์งˆ์˜์–ด์™€ ์–ผ๋งˆ๋‚˜ ๋ถ€ํ•ฉํ•˜๋Š”์ง€

๋ชจ๋ธ์ด True๋ผ๊ณ  ๋ถ„๋ฅ˜ํ•œ ๊ฒƒ ์ค‘์—์„œ ์‹ค์ œ True์ธ ๊ฒƒ์˜ ๋น„์œจ(=์ •๋‹ต๋ฅ )

์‹ค์ œ๋กœ ์‚ฌ์šฉ์ž๊ฐ€ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ฌธ์„œ / ๋ชจ๋ธ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ

์ •๋‹ต / ๋‚ด๊ฐ€

Recall ์žฌํ˜„์œจ : ์งˆ์˜์–ด์— ๋ถ€ํ•ฉํ•˜๋Š” ๋ฌธ์„œ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ฐพ์•„๋‚ด๋Š”์ง€

์‹ค์ œ True์ธ ๊ฒƒ ์ค‘์—์„œ ๋ชจ๋ธ์ด True๋ผ๊ณ  ์˜ˆ์ธกํ•œ ๊ฒƒ์˜ ๋น„์œจ(=์ œ๊ณต๋ฅ )

์ •๋‹ต / ์‹ค์ œ

์ •ํ™•๋„(Accuracy)๋Š” ์ •๋ณด ๊ฒ€์ƒ‰ ํ‰๊ฐ€์— ์“ฐ์ด์ง€ ์•Š์Œ

์ •๋ฐ€๋„๊ฐ€ ๋†’์œผ๋ฉด ์žฌํ˜„์œจ์ด ๋‚ฎ๊ณ , ์ •๋ฐ€๋„๊ฐ€ ๋‚ฎ์œผ๋ฉด ์žฌํ˜„์œจ์ด ๋†’์Œ.

์ •๋ฐ€๋„๊ฐ€ 100์ธ ๊ฒฝ์šฐ : ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ฌธ์„œ 100๊ฐœ์—์„œ 1๊ฐœ ์ฐพ์Œ. → ์žฌํ˜„์œจ์€ 0.01

์žฌํ˜„์œจ์ด 100์ธ ๊ฒฝ์šฐ : ์ฐพ์€ ๋ฌธ์„œ 100๊ฐœ ์ค‘ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ชจ๋“  ๋ฌธ์„œ๊ฐ€ 1๊ฐœ์ž„ → ์ •๋ฐ€๋„๋Š” 0.01

F-measure : ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์— ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๊ณ  ๊ตฌํ•œ ํ‰๊ท 

→ ํ”ํžˆ ์•ŒํŒŒ(๊ฐ€์ค‘์น˜)๋ฅผ 0.5๋กœ ์ฃผ์–ด ์•„๋ž˜์™€ ๊ฐ™์ด ๋งŒ๋“ฆ

F-measure๋กœ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ ํ‰๊ฐ€์— ์‚ฌ์šฉํ•จ.

์˜ˆ์‹œ)

Q1. ๊ฒ€์ƒ‰ ๋Œ€์ƒ ๋ฌธ์„œ 100๋งŒ ๊ฐœ, ๊ฐ ๋ฌธ์„œ๋Š” ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์˜ ๋‹จ์–ด๊ฐ€ ์žˆ์Œ. ํ•œ ๋‹จ์–ด๋Š” ํ‰๊ท ์ ์œผ๋กœ ๋„์–ด์“ฐ๊ธฐ, ๋ถ€ํ˜ธ ํฌํ•จ 6bytes(6๊ธ€์ž). ๋ฌธ์„œ ์ „์ฒด ํฌ๊ธฐ๋Š”?

A1. ๋ฌธ์„œ ์ „์ฒด ํฌ๊ธฐ = 100๋งŒ * (1000 * 6bytes) = 6,000,000,000 = 60์–ตbytes = 6GB

Q2. 10์–ต(100๋งŒ * 1000 = 1,000,000,000)๊ฐœ ์˜ ๋‹จ์–ด ์ค‘, 500,000๊ฐœ์˜ ๋‹ค๋ฅธ ๋‹จ์–ด๊ฐ€ ์žˆ๋‹ค. Term-Document Matrix์˜ ํฌ๊ธฐ๋Š”?

A2.

Term ๊ฐœ์ˆ˜ = 50๋งŒ

Document ๊ฐœ์ˆ˜ = 100๋งŒ

Term-Document Matrix ํฌ๊ธฐ = 50๋งŒ * 100๋งŒ = 5์ฒœ ์–ต

→ Boolean ๊ฒ€์ƒ‰์„ ์œ„ํ•ด Term-Document Matrix๋ฅผ ๋ฏธ๋ฆฌ ๋งŒ๋“ค์–ด์•ผ ํ•˜๋Š”๋ฐ, ํฌ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ํผ.

5์ฒœ ์–ต๊ฐœ์˜ 0๊ณผ 1์ค‘ 0์˜ ๋น„์ค‘์ด ๋ณดํ†ต 99.8%๋ฅผ ์ฐจ์ง€ํ•˜์—ฌ 1์€ 0.2%๋ฅผ ์ฐจ์ง€ํ•˜์—ฌ 1000๊ฐœ๋กœ ๋งŽ์ง€ ์•Š์Œ.

๋”ฐ๋ผ์„œ, 1์ธ ๊ฒƒ๋งŒ ๋”ฐ๋กœ ํ‘œ์‹œํ•˜๋„๋ก ํ•จ. ⇒ Inverted Index (Inverted file)

0์ด ์ฐจ์ง€ํ•˜๋Š” ๋น„์œจ

term(ํ–‰) = 50๋งŒ๊ฐœ, document(์—ด) = 100๋งŒ๊ฐœ

ํ•˜๋‚˜์˜ ๋ฌธ์„œ์˜ term ์ •๋ณด = ํ•œ ์—ด(50๋งŒ๊ฐœ์˜ term), ํ•˜๋‚˜์˜ ๋ฌธ์„œ๋Š” 1000๊ฐœ์˜ ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋จ ⇒ ํ•˜๋‚˜์˜ ๋ฌธ์„œ์— term ์ค‘ ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์˜ ๋‹จ์–ด๋งŒ ๋‚˜์˜ค๋ฏ€๋กœ 50๋งŒ ๊ฐœ์˜ ํ•œ ์—ด์—์„œ 1์€ ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์ž„. 49๋งŒ9์ฒœ๊ฐœ๋Š” 0์ž„. ⇒ 50๋งŒ ๊ฐœ์—์„œ 1000๊ฐœ๊ฐ€ 1์ด์—ˆ์œผ๋‹ˆ 100๋งŒ๊ฐœ์—์„œ๋Š” 2000๊ฐœ๊ฐ€ 1์ž„. ⇒ 2์ฒœ/100๋งŒ = 0.002. 0.2%๊ฐ€ 1์ด๊ณ , 99.8%๊ฐ€ 0์ž„.

Inverted index (inverted file)

Term-Document Matrix์˜ ๊ณต๊ฐ„ ์ƒ์˜ ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด 1์ธ ๋ฌธ์„œ๋งŒ ๋ชฉ๋กํ™”ํ•จ.

๊ฒ€์ƒ‰ ๋Œ€์ƒ(๋ฌธ์„œ, ์›น, PC๊ฒฝ๋กœ)์— ๋”ฐ๋ผ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— Posting list์—๋Š” ๋…ผ๋ฆฌ์ ์œผ๋กœ ๋ฌธ์„œ์˜ ๋ฒˆํ˜ธ ์‹๋ณ„์ž๋ฅผ ๋ถ€์—ฌํ•จ.

Posting : ๋ฌธ์„œ ๋ฒˆํ˜ธ ์‹๋ณ„์ž ํ•˜๋‚˜ํ•˜๋‚˜

Posting list : Posting์ด ๋‚˜์—ด๋œ ํ•˜๋‚˜์˜ ๋ฐฐ์—ด(๋ฆฌ์ŠคํŠธ)

Posting lists : Posting list์˜ ์ง‘ํ•ฉ

์งˆ์˜์–ด๋ฅผ ์ฐพ์„ ๋•Œ ๋” ๋นจ๋ฆฌ ์ฐพ์„ ์ˆ˜ ์žˆ๋„๋ก ๋ฌธ์„œ๋ฅผ ์ •๋ ฌํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, Posting list๋Š” ๋ฐฐ์—ด๋ณด๋‹ค ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ๋ฅผ ์„ ํ˜ธํ•œ๋‹ค. ํ•˜์ง€๋งŒ ํฌ์ธํ„ฐ ๊ณต๊ฐ„์„ ๋” ์ฐจ์ง€ํ•œ๋‹ค.

Dictionary == Vocabulary (Term์˜ ์ง‘ํ•ฉ)

Inverted Index ๋งŒ๋“œ๋Š” ๊ณผ์ •

[๋‹จ๊ณ„]

  1. Tokenization (ํ† ํฐํ™”)
    1. → ๋งŒ์•ฝ, ๋ฌธ์„œ๊ฐ€ TXT๊ฐ€ ์•„๋‹Œ ์•„๋ž˜ํ•œ๊ธ€, html์ด๋ผ๋ฉด? → ๊ฒฝ์šฐ์— ๋”ฐ๋ผ ํ† ํฐํ™” ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Œ.
    2. Linguistic Modules๋ฅผ ํ†ตํ•ด ์–ธ์–ด์— ๋”ฐ๋ผ ์ฒ˜๋ฆฌํ•จ. (์˜์–ด : ๋ชจ๋‘ ์†Œ๋ฌธ์ž๋กœ, ์›ํ˜•์œผ๋กœ ๋งŒ๋“ฆ)
    3. Indexer : ํ† ํฐ์„ ์ธ๋ฑ์Šค๋กœ ๋งŒ๋“ฆ
      • ์–ด๋А ํ† ํฐ์ด ์–ด๋А ๋ฌธ์„œ์—์„œ ๋‚˜์™”๋Š”์ง€ Postings list(ํ•ด๋‹น ํ† ํฐ์˜ ๋ฌธ์„œ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ„) ๊ตฌ์ถ•
      • Inverted Index๋ฅผ ๋งŒ๋“ฆ.
        1. (Term, DocID) ์Œ์œผ๋กœ ๋งŒ๋“ฆ
        2. Term์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•จ.
        3. ํ•„์š” ์‹œ, TF ์ •๋ณด ์ถ”๊ฐ€
          • ํ•œ ๋ฌธ์„œ์˜ ์ค‘๋ณต Term์€ ํ•˜๋‚˜๋กœ ํ†ต์ผ
          • Term Frequency(TF) : ํ•œ ๋ฌธ์„œ๋ฅผ ๊ธฐ์ค€์œผ๋กœ Term์ด ๋‚˜์˜จ ๊ฐœ์ˆ˜ ์ถ”๊ฐ€
        4. ํ•„์š” ์‹œ, DF ์ •๋ณด ์ถ”๊ฐ€
          • Document Frequency(DF) : Term์ด ๋‚˜์˜จ ๋ฌธ์„œ์˜ ๊ฐœ์ˆ˜ ์ถ”๊ฐ€
          • ์ด๋•Œ Term Frequency๋ผ๋ฆฌ ํ•ฉ์„ ๊ตฌํ•จ
          • DF๋กœ ํ•ฉ์ณ์ง„ ์ค‘๋ณต Term์€ ๊ฐ๊ฐ ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ ์‹์œผ๋กœ ์ž์‹ ์˜ (DocID, TF)๋ฅผ ๊ฐ€๋ฆฌํ‚ด → ๋‚˜์ค‘์— ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ ํ˜•์‹์˜ Inverted Index์˜ ๊ธฐํ‹€์ด ๋จ.  

์‚ฌ๋žŒ์ด ์“ฐ๋Š” ๋‹จ์–ด๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ Dictionary size๋Š” Postings list์— ๋น„ํ•ด ๋ณ€ํ™”๊ฐ€ ์ ์Œ.

๋”ฐ๋ผ์„œ, Dictionary๋Š” ๊ฒ€์ƒ‰ ์„œ๋น„์Šค๋ฅผ ์œ„ํ•ด Main Memory๋กœ load ๋จ.

Postings๋Š” ํฌ๊ธฐ๊ฐ€ ์ ์  ์ปค์ง€๊ณ  ๋ณ€ํ™”๋„ ์žฆ์•„ Disk์— ์ €์žฅ๋จ.

 

Postings list ์ €์žฅ ๋ฐฉ๋ฒ•

  1. ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ
  2. ๊ฐ€๋ณ€ ๋ฐฐ์—ด
  3. hybrid scheme : ์—ฐ๊ฒฐ๋ฆฌ์ŠคํŠธ + ๊ฐ€๋ณ€ ๋ฐฐ์—ด (๋ฐฐ์—ด์„ ์—ฐ๊ฒฐ๋ฆฌ์ŠคํŠธ๋กœ ์—ฐ๊ฒฐํ•จ.)

์งˆ์˜์–ด

  1. Conjunctive : AND์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ํ•ฉif not sorting ์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ๊ณฑ
  2. ์กฐ๊ฑด : Posting list๊ฐ€ ์ •๋ ฌ๋˜์–ด ์žˆ์–ด์•ผ ํ•จ.
  3. Posting list๋ผ๋ฆฌ ์ฒซ ๋ฒˆ์งธ ์ธ๋ฑ์Šค๋ถ€ํ„ฐ ๋น„๊ตํ•ด์„œ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๊ฐ™์œผ๋ฉด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ , ๋‹ค๋ฅด๋ฉด ์ž‘์€ ์ชฝ์˜ Posting์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•˜๊ณ  ๋‹ค์‹œ ๋น„๊ต
  4. Disjunctive : OR์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ํ•ฉ
  5. Posting list๋ผ๋ฆฌ ์ฒซ ๋ฒˆ์งธ ์ธ๋ฑ์Šค๋ถ€ํ„ฐ ๋น„๊ตํ•ด์„œ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๊ฐ™์œผ๋ฉด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ , ๋‹ฌ๋ผ๋„ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๋” ์ž‘์€ ์ชฝ์„ ๋จผ์ € ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ  ์ž‘์€ ์ชฝ์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค.
  6. NOT → ์‹œ๊ฐ„ ์˜ค๋ž˜ ๊ฑธ๋ฆผNOT์ด ๋ถ™์€ Posting list๋Š” ์˜ค๋ฅธ์ชฝ์œผ๋กœ ๋„˜์–ด๊ฐ€๊ธฐ ์ „์—, ์ž์‹ ์—๊ฒŒ ์—†๋Š” Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ์™€ ๋‹ค๋ฅธ Posting list์™€ ์—ฐ์‚ฐํ•จ.NOT์ด ๋ถ™์€ Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ NOT์ด ๋ถ™์ง€ ์•Š์€ Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ์™€ ๋‹ค๋ฅด๋ฉด, ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๊ณ  ์ž‘์€ ์ชฝ์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค. ๊ฐ™์œผ๋ฉด ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๊ณ  ๋‘ Posting ๋ชจ๋‘ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค.
  7. <AND>
  8. NOT์ด ๋ถ™์€ ๋‹จ์–ด๊ฐ€ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จ๋˜๋ฉด ์•ˆ ๋จ.

WestLaw ์‹œ์Šคํ…œ

์งˆ์˜์–ด๋ฅผ ๋งŒ๋“ค ๋•Œ, AND, OR, NOT + /์ˆซ์ž, ! ๊ฐ€๋Šฅํ•จ.

๋„์–ด์“ฐ๊ธฐ : OR

/<์ˆซ์ž> : <์ˆซ์ž> ๋‹จ์–ด ์ด๋‚ด์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ ๋•Œ, AND

! : Wild-card ๋ฌธ์ž *

/s : ํ•œ ๋ฌธ์žฅ ์•ˆ์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‘˜ ๋‹ค ๋‚˜์™€์•ผ ํ•จ. AND

/p : ํ•œ ๋ฌธ๋‹จ ์•ˆ์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‘˜ ๋‹ค ๋‚˜์™€์•ผ ํ•จ. AND

Proximity operators : ์ธ์ ‘์„ฑ์„ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ๋Š” ์—ฐ์‚ฐ ๊ธฐํ˜ธ

์—ฐ์‚ฐ ์ˆœ์„œ ์ตœ์ ํ™”

  1. ์•ž๋ถ€ํ„ฐ ์ฒ˜๋ฆฌ
  2. ์ˆœ์„œ ๋ฐ”๊ฟ”์„œ ์ฒ˜๋ฆฌ → ์‹œ๊ฐ„ ์ ˆ์•ฝ ๊ฐ€๋ŠฅOR ์‹œ๊ฐ„ ๋ณต์žก๋„ : Posting list size์˜ ํ•ฉ
    1. Doc Freq๊ฐ€ ์ž‘์€ ๊ฒƒ๋ถ€ํ„ฐ ์ฒ˜๋ฆฌํ•จ.
    2. Posting list size == Doc Freq
    3. ์—ฐ์‚ฐ ์‹œ ์‹œ๊ฐ„ ๋ณต์žก๋„๋ฅผ ์ตœ๋Œ€๋กœ ๊ณ„์‚ฐํ•ด์„œ ์‹œ๊ฐ„ ๋ณต์žก๋„๊ฐ€ ์ž‘์€ ์ˆœ์„œ๋Œ€๋กœ ์ฒ˜๋ฆฌํ•จ.
  3. AND ์‹œ๊ฐ„ ๋ณต์žก๋„ : Posting list size์˜ ํ•ฉ

์ธ์ ‘์กฐ๊ฑด

์งˆ์˜์–ด์˜ ๋‹จ์–ด๊ฐ€ ๋„์–ด์“ฐ๊ธฐ ๋‹จ์œ„๋กœ ๋Š์–ด์ง€๋ฉด ์•ˆ ๋˜๊ณ  ๋ถ™์–ด์žˆ์–ด์•ผ ํ•  ๋•Œ

Proximity ์ธ์ ‘์กฐ๊ฑด ๋ช…๋ น์–ด : NEAR

๊ตฌ์กฐํ™”๋œ TEXT๊นŒ์ง€ ์ƒ๊ฐํ•ด์„œ ๊ฒ€์ƒ‰ํ•ด๋ณด๋Š” ๋ฐฉ๋ฒ•๋„ ์ƒ๊ฐํ•ด๋ณด๊ธฐ(์ €์ž์™€ ๋‚ด์šฉ ๊ตฌ์กฐ๊ฐ€ ๋งž๋Š” ๊ฒ€์ƒ‰)

Inverted Index + Position Information

Position Information : Posting์ด ํ•ด๋‹น ๋ฌธ์„œ๊ฐ€ ๊ทธ ๋ฌธ์„œ ๋‚ด์—์„œ ์–ด๋””์— ๋‚˜์™”๋Š”์ง€ ์œ„์น˜ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด list๋ฅผ ๊ฐ€๋ฆฌํ‚ด

Position Information Size = Term Frequency

๋‘ ๋‹จ์–ด๊ฐ€ ์ธ์ ‘ํ•ด ์žˆ๋Š”์ง€ ์ฐพ๋Š” ๋ฒ• : AND ์—ฐ์‚ฐ์œผ๋กœ ๊ฐ ๋ฌธ์„œ์— ํ•ด๋‹น Term ๋‚˜์™”๋Š”์ง€ ๊ฒ€์‚ฌ → ๋‘ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋น„๊ตํ•จ → ์ธ์ ‘ ์กฐ๊ฑด์— ๋งž์„ ๊ฒฝ์šฐ ๊ฒฐ๊ณผ์— ํฌํ•จ

์ธ์ ‘ ์ฐจ์ด : 1์ด๋ฉด ๋‘ ๋‹จ์–ด๊ฐ€ ๋ถ™์–ด์žˆ์Œ์„ ์˜๋ฏธํ•˜๊ณ , 2๋ฉด ๋‘ ๋‹จ์–ด ์‚ฌ์ด์— ํ•œ ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Œ์„ ์˜๋ฏธํ•จ.

Term Freq ํ™œ์šฉ

Term Freq๊ฐ€ ํฐ ์ˆœ์œผ๋กœ ๊ฒฐ๊ณผ ์ œ๊ณตํ•˜๊ธฐ

์–ด๋–ค Term์ด ํ•˜๋‚˜์˜ ๋ฌธ์„œ์—์„œ Document์—์„œ ๋ช‡ ๋ฒˆ ๋‚˜์™”๋Š”์ง€ ๊ณ ๋ ค

์–ด๋–ค Posting์˜ Term Freq == ๊ทธ Posting์˜ Positional Information ๋ฆฌ์ŠคํŠธ์˜ ๊ธธ์ด

๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋กœ, ๊ฐ ๋ฌธ์„œ์˜ Term Freq์˜ ํ•ฉ์ด ํฐ ๋ฌธ์„œ๋ถ€ํ„ฐ ์ œ๊ณตํ•จ.

Ranking Search

Boolean ๋ชจ๋ธ์€ ํ•ด๋‹น ์งˆ์˜์–ด๊ฐ€ ๋ฌธ์„œ์— ํฌํ•จ ๋๋Š”์ง€ ์•ˆ ๋๋Š”์ง€๋งŒ ํŒ๋‹จํ•˜๊ธฐ ๋•Œ๋ฌธ์— Ranking์„ ๋งค๊ธฐ๋Š” ๊ฑด ์›์น™์ ์œผ๋กœ๋Š” ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.

ํ•˜์ง€๋งŒ, ๊ธฐ์ค€์„ ์ ์šฉํ•ด์„œ ์กฐ๊ธˆ ๋” ์œ ์šฉํ•  ๋งŒํ•œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•  ์ˆ˜๋„ ์žˆ๋‹ค. Proximity ์ธ์ ‘์„ฑ์„ ์ด์šฉํ•ด์„œ ์ธ์ ‘ํ•  ์ˆ˜๋ก ์‚ฌ์šฉ์ž๊ฐ€ ๋” ์›ํ•˜๋Š” ์ •๋ณด์— ๋” ๊ฐ€๊นŒ์šธ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๊ฐ€๊นŒ์šด ์ˆœ์œผ๋กœ ์ œ๊ณตํ•˜๊ฑฐ๋‚˜, ๋ฌธ์„œ์˜ Term Freq๋ฅผ ์ด์šฉํ•ด์„œ ํฐ ์ˆœ์œผ๋กœ ์ œ๊ณตํ•˜๊ฑฐ๋‚˜, ๋ฌธ์„œ๊ฐ€ ๋งŒ๋“ค์–ด์ง„ ์‹œ๊ฐ„์„ ์ธ๋ฑ์Šค์— ๋ถ€์—ฌํ•ด์„œ ์ตœ๊ทผ ๋ฌธ์„œ๋ถ€ํ„ฐ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๊ธด ํ•˜๋‹ค.

์ •๋ณด ๊ฒ€์ƒ‰ VS DB ๊ฒ€์ƒ‰

์ •๋ณด ๊ฒ€์ƒ‰

๋น„๊ตฌ์กฐํ™” ๋จ

๋ฒ”์œ„ ๊ฒ€์ƒ‰ ๋ถˆ๊ฐ€๋Šฅ

๋‹จ, TEXT์ด๋”๋ผ๊ณ  ์•ฝ๊ฐ„์˜ ๊ตฌ์กฐ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Œ. Semi-Structured Data ์˜ˆ) PPT, XML

๊ธ€์ž์˜ ํฌ๊ธฐ, ๊ธ€์ž์˜ ๊ตต๊ธฐ์™€ ๊ฐ™์€ ์ •๋ณด๋ฅผ ์ด์šฉํ•ด ์•ฝ๊ฐ„์˜ ๊ตฌ์กฐ๋ฅผ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ.

DB ๊ฒ€์ƒ‰

๊ตฌ์กฐํ™”๋จ.

๋ฒ”์œ„ ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅ

๊ด€๋ จ ์žˆ๋Š” Concept ๊ฒ€์ƒ‰ : ์•„์ง ๋ถˆ๊ฐ€๋Šฅ

์šฉ์–ด

Clustering : ๊ตฐ์ง‘(๋ชจ์œผ๊ธฐ)

Classfication : ๋ถ„๋ฅ˜

์›น ๊ฒ€์ƒ‰

๋‹ค์–‘ํ•œ ๋ฌธ์„œ, ๋‹ค์–‘ํ•œ ์‚ฌ๋žŒ, ์งˆ์˜์–ด, ์ •๋ณด ์กด์žฌํ•จ.

๋งํฌ๋œ ์ •๋ณด ํ™œ์šฉํ•˜๊ฑฐ๋‚˜ ํด๋ฆญํ•œ ์ •๋ณด ํ™œ์šฉ ๊ฐ€๋Šฅํ•จ.

Cross-language information retrieval : ๊ต์ฐจ ์–ธ์–ด, ๋ฒˆ์—ญํ•ด์„œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ ์ œ๊ณต

Question answering : ์งˆ์˜ ์‘๋‹ต ์‹œ์Šคํ…œ ์งˆ๋ฌธ&๋‹ต๋ณ€

Summarization : ๊ฒ€์ƒ‰๊ฒฐ๊ณผ ์š”์•ฝํ•ด์„œ ์ œ๊ณต

TEXT mining : TEXT์—์„œ ํ•„์š”ํ•œ ์ •๋ณด ๋ฝ‘์•„์„œ ์•Œ๋ ค์คŒ

1996๋…„) TEXT์™€ ๊ฐ™์ด ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ์˜ ๊ทœ๋ชจ๊ฐ€ ๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋ณด๋‹ค ์ปธ์Œ. ๊ทผ๋ฐ ์‹œ์žฅ์—์„œ๋Š” ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ฑด ๋ˆ์ด ๋˜์ง€ ์•Š์•˜์Œ

2006๋…„) ์—ฌ์ „ํžˆ ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ์˜ ๊ทœ๋ชจ๊ฐ€ ํฌ์ง€๋งŒ TEXT ๊ฒ€์ƒ‰ ์„œ๋น„์Šค ์‹œ์žฅ์ด ์ปค์ง€๋ฉด์„œ ๋น„๊ตฌ์กฐํ™” ๋ฌธ์„œ๋ฅผ ๋‹ค๋ฃจ๋Š” ์‹œ์žฅ์ด ์„ฑ์žฅํ•จ

์ •๋ณด๊ฒ€์ƒ‰์‹œ์Šคํ…œ์ด ์—†๋‹ค๋ฉด, ์œ ๋‹‰์Šค์˜ grep๋ช…๋ น์–ด, | ๋กœ ์ฐพ์„ ์ˆ˜ ์žˆ์Œ. ํ•˜์ง€๋งŒ, ๋ฌธ์„œ์˜ ์–‘์ด ํฌ๋‹ค๋ฉด ์†๋„๊ฐ€ ๋А๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— ์ •๋ณด ๊ฒ€์ƒ‰ ๋ชฉ์ ์œผ๋กœ๋Š” ์ ์ ˆํ•˜์ง€ ์•Š๋‹ค. ํŠน์ • ๋‚ด์šฉ์ด ๋“ค์–ด๊ฐ€ ์žˆ๋Š” ๊ฑด ๋“ค์–ด๊ฐ€ ์žˆ์œผ๋ฉด ๋” ํƒ์ƒ‰ํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋‹ˆ ์‰ฌ์šด ํŽธ์ด์ง€๋งŒ, ํŠน์ • ๋‚ด์šฉ์ด ๋“ค์–ด๊ฐ€ ์žˆ์ง€ ์•Š์€ ๊ฑด ํŒŒ์ผ์„ ๋๊นŒ์ง€ ๋‹ค ํƒ์ƒ‰ํ•ด ๋ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋ ต๋‹ค. ๋‹จ์–ด๊ฐ€ ์ธ์ ‘ํ•ด ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋Š” grep ๋ช…๋ น์œผ๋กœ ์ฐพ์„ ์ˆœ ์—†๋‹ค.

Boolean ๋ชจ๋ธ์€ ๋‹จ์–ด ํฌํ•จ ์—ฌ๋ถ€๋งŒ ์•Œ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋กœ ์ถ”์ถœ๋œ ๋ฌธ์„œ๋“ค์˜ ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๊ธด ํž˜๋“ค๋‹ค. grep ๋˜ํ•œ, Boolean ๋ชจ๋ธ์— ๋”ฐ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๊ธด ํž˜๋“ค๋‹ค.

Boolean ๋ชจ๋ธ๋„ ์ ์ˆ˜๋ฅผ ๋งค๊ฒจ์„œ Sortํ•  ์ˆ˜ ์žˆ๊ธด ํ•˜์ง€๋งŒ, ๊ธฐ๋ณธ์ ์œผ๋ก  ์ˆœ์œ„ ๋งค๊ธธ ์ˆ˜ ์—†๋‹ค.

$ grep -v <์งˆ์˜์–ด> * : ๋ชจ๋“  ๋ฌธ์„œ ํŒŒ์ผ(*)์—์„œ ์งˆ์˜์–ด๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์€(-v) ๋ฌธ์„œ๋ฅผ ํƒ์ƒ‰ํ•œ๋‹ค.

Term-document (incidence) matrix Term๊ณผ Document์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ–‰๋ ฌ

๊ฒ€์ƒ‰ ๋Œ€์ƒ์ด ๋˜๋Š” ๋ชจ๋“  ๋ฌธ์„œ์— ๋Œ€ํ•ด, Term์ด Document์— ๋‚˜์˜ค๋ฉด 1, ๋‚˜์˜ค์ง€ ์•Š์œผ๋ฉด 0

์งˆ์˜์–ด์™€ ์ผ์น˜ํ•˜๋Š” Term์˜ ํ–‰๋“ค์„ ์ฐพ์•„์„œ ์ฃผ์–ด์ง„ ์งˆ์˜์–ด์— ๋”ฐ๋ผ ํ–‰๋ผ๋ฆฌ AND, OR, NOT ์—ฐ์‚ฐ ์ˆ˜ํ–‰

์ด๋•Œ, NOT์€ 1๊ณผ 0์„ ๋งž๋ฐ”๊ฟˆ.

bitwise : ๊ฐ™์€ ์œ„์น˜์˜ bit๋ผ๋ฆฌ ์—ฐ์‚ฐ

์˜ˆ) ์งˆ์˜์–ด : Brutus AND Caesar but NOT Calpurnia

Brutus = 110100, Caesar = 110111, Calpurnia = 010000

์—ฐ์‚ฐ : 110100 AND 110111 AND 101111 = 100100

→ 1, 4๋ฒˆ์งธ ๋ฌธ์„œ๊ฐ€ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จ๋˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ํฌํ•จ๋˜์ง€ ์•Š์Œ.

Term-document (incidence) matrix๊ฐ€ ์ฃผ์–ด์ ธ ์žˆ๋‹ค๋ฉด, ์งˆ์˜์–ด ํ‚ค์›Œ๋“œ์— ํ•ด๋‹นํ•˜๋Š” Term์˜ ํ–‰์„ ์ฐพ์•„์„œ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•ด๋‚ผ ์ˆ˜ ์žˆ์Œ. (Boolean ๊ฒ€์‚ฌ ๊ธฐ๋ณธ ๋ฐฉ๋ฒ•๋ก )

Boolean Model ๊ธฐ๋ณธ ๊ฐ€์ • : ๋ฌธ์„œ์˜ ์ง‘ํ•ฉ์€ ๊ณ ์ •๋˜์–ด ์žˆ์Œ.

๋ง๋ญ‰์น˜ : Corpus, ๋ฌธ์„œ์˜ ์ง‘ํ•ฉ

Refine : ์‚ฌ๋žŒ์ด๋‚˜ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์ด ์งˆ์˜์–ด๋ฅผ ๋” ์ •๊ตํ•˜๊ฒŒ ์ˆ˜์ •ํ•ด์„œ ์žฌ๊ฒ€์ƒ‰์„ ํ•˜๋Š” ํ–‰์œ„

๊ฒ€์ƒ‰๋„ ์‹œ์Šคํ…œ ํ‰๊ฐ€ ๊ธฐ์ค€

Precision ์ •๋ฐ€๋„ : ๊ฒฐ๊ณผ๊ฐ€ ์งˆ์˜์–ด์™€ ์–ผ๋งˆ๋‚˜ ๋ถ€ํ•ฉํ•˜๋Š”์ง€

๋ชจ๋ธ์ด True๋ผ๊ณ  ๋ถ„๋ฅ˜ํ•œ ๊ฒƒ ์ค‘์—์„œ ์‹ค์ œ True์ธ ๊ฒƒ์˜ ๋น„์œจ(=์ •๋‹ต๋ฅ )

์‹ค์ œ๋กœ ์‚ฌ์šฉ์ž๊ฐ€ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ฌธ์„œ / ๋ชจ๋ธ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ

์ •๋‹ต / ๋‚ด๊ฐ€

Recall ์žฌํ˜„์œจ : ์งˆ์˜์–ด์— ๋ถ€ํ•ฉํ•˜๋Š” ๋ฌธ์„œ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ฐพ์•„๋‚ด๋Š”์ง€

์‹ค์ œ True์ธ ๊ฒƒ ์ค‘์—์„œ ๋ชจ๋ธ์ด True๋ผ๊ณ  ์˜ˆ์ธกํ•œ ๊ฒƒ์˜ ๋น„์œจ(=์ œ๊ณต๋ฅ )

์ •๋‹ต / ์‹ค์ œ

์ •ํ™•๋„(Accuracy)๋Š” ์ •๋ณด ๊ฒ€์ƒ‰ ํ‰๊ฐ€์— ์“ฐ์ด์ง€ ์•Š์Œ

์ •๋ฐ€๋„๊ฐ€ ๋†’์œผ๋ฉด ์žฌํ˜„์œจ์ด ๋‚ฎ๊ณ , ์ •๋ฐ€๋„๊ฐ€ ๋‚ฎ์œผ๋ฉด ์žฌํ˜„์œจ์ด ๋†’์Œ.

์ •๋ฐ€๋„๊ฐ€ 100์ธ ๊ฒฝ์šฐ : ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ฌธ์„œ 100๊ฐœ์—์„œ 1๊ฐœ ์ฐพ์Œ. → ์žฌํ˜„์œจ์€ 0.01

์žฌํ˜„์œจ์ด 100์ธ ๊ฒฝ์šฐ : ์ฐพ์€ ๋ฌธ์„œ 100๊ฐœ ์ค‘ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ชจ๋“  ๋ฌธ์„œ๊ฐ€ 1๊ฐœ์ž„ → ์ •๋ฐ€๋„๋Š” 0.01

**F-measure : ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์— ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๊ณ  ๊ตฌํ•œ ํ‰๊ท **

→ ํ”ํžˆ ์•ŒํŒŒ(๊ฐ€์ค‘์น˜)๋ฅผ 0.5๋กœ ์ฃผ์–ด ์•„๋ž˜์™€ ๊ฐ™์ด ๋งŒ๋“ฆ

F-measure๋กœ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ ํ‰๊ฐ€์— ์‚ฌ์šฉํ•จ.

์˜ˆ์‹œ)

Q1. ๊ฒ€์ƒ‰ ๋Œ€์ƒ ๋ฌธ์„œ 100๋งŒ ๊ฐœ, ๊ฐ ๋ฌธ์„œ๋Š” ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์˜ ๋‹จ์–ด๊ฐ€ ์žˆ์Œ. ํ•œ ๋‹จ์–ด๋Š” ํ‰๊ท ์ ์œผ๋กœ ๋„์–ด์“ฐ๊ธฐ, ๋ถ€ํ˜ธ ํฌํ•จ 6bytes(6๊ธ€์ž). ๋ฌธ์„œ ์ „์ฒด ํฌ๊ธฐ๋Š”?

A1. ๋ฌธ์„œ ์ „์ฒด ํฌ๊ธฐ = 100๋งŒ * (1000 * 6bytes) = 6,000,000,000 = 60์–ตbytes = 6GB

Q2. 10์–ต(100๋งŒ * 1000 = 1,000,000,000)๊ฐœ ์˜ ๋‹จ์–ด ์ค‘, 500,000๊ฐœ์˜ ๋‹ค๋ฅธ ๋‹จ์–ด๊ฐ€ ์žˆ๋‹ค. Term-Document Matrix์˜ ํฌ๊ธฐ๋Š”?

A2.

Term ๊ฐœ์ˆ˜ = 50๋งŒ

Document ๊ฐœ์ˆ˜ = 100๋งŒ

Term-Document Matrix ํฌ๊ธฐ = 50๋งŒ * 100๋งŒ = 5์ฒœ ์–ต

→ Boolean ๊ฒ€์ƒ‰์„ ์œ„ํ•ด Term-Document Matrix๋ฅผ ๋ฏธ๋ฆฌ ๋งŒ๋“ค์–ด์•ผ ํ•˜๋Š”๋ฐ, ํฌ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ํผ.

5์ฒœ ์–ต๊ฐœ์˜ 0๊ณผ 1์ค‘ 0์˜ ๋น„์ค‘์ด ๋ณดํ†ต 99.8%๋ฅผ ์ฐจ์ง€ํ•˜์—ฌ 1์€ 0.2%๋ฅผ ์ฐจ์ง€ํ•˜์—ฌ 1000๊ฐœ๋กœ ๋งŽ์ง€ ์•Š์Œ.

๋”ฐ๋ผ์„œ, 1์ธ ๊ฒƒ๋งŒ ๋”ฐ๋กœ ํ‘œ์‹œํ•˜๋„๋ก ํ•จ. ⇒ Inverted Index (Inverted file)

0์ด ์ฐจ์ง€ํ•˜๋Š” ๋น„์œจ

term(ํ–‰) = 50๋งŒ๊ฐœ, document(์—ด) = 100๋งŒ๊ฐœ

ํ•˜๋‚˜์˜ ๋ฌธ์„œ์˜ term ์ •๋ณด = ํ•œ ์—ด(50๋งŒ๊ฐœ์˜ term), ํ•˜๋‚˜์˜ ๋ฌธ์„œ๋Š” 1000๊ฐœ์˜ ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋จ ⇒ ํ•˜๋‚˜์˜ ๋ฌธ์„œ์— term ์ค‘ ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์˜ ๋‹จ์–ด๋งŒ ๋‚˜์˜ค๋ฏ€๋กœ 50๋งŒ ๊ฐœ์˜ ํ•œ ์—ด์—์„œ 1์€ ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์ž„. 49๋งŒ9์ฒœ๊ฐœ๋Š” 0์ž„. ⇒ 50๋งŒ ๊ฐœ์—์„œ 1000๊ฐœ๊ฐ€ 1์ด์—ˆ์œผ๋‹ˆ 100๋งŒ๊ฐœ์—์„œ๋Š” 2000๊ฐœ๊ฐ€ 1์ž„. ⇒ 2์ฒœ/100๋งŒ = 0.002. 0.2%๊ฐ€ 1์ด๊ณ , 99.8%๊ฐ€ 0์ž„.

Inverted index (inverted file)

Term-Document Matrix์˜ ๊ณต๊ฐ„ ์ƒ์˜ ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด 1์ธ ๋ฌธ์„œ๋งŒ ๋ชฉ๋กํ™”ํ•จ.

๊ฒ€์ƒ‰ ๋Œ€์ƒ(๋ฌธ์„œ, ์›น, PC๊ฒฝ๋กœ)์— ๋”ฐ๋ผ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— Posting list์—๋Š” ๋…ผ๋ฆฌ์ ์œผ๋กœ ๋ฌธ์„œ์˜ ๋ฒˆํ˜ธ ์‹๋ณ„์ž๋ฅผ ๋ถ€์—ฌํ•จ.

Posting : ๋ฌธ์„œ ๋ฒˆํ˜ธ ์‹๋ณ„์ž ํ•˜๋‚˜ํ•˜๋‚˜

Posting list : Posting์ด ๋‚˜์—ด๋œ ํ•˜๋‚˜์˜ ๋ฐฐ์—ด(๋ฆฌ์ŠคํŠธ)

Posting lists : Posting list์˜ ์ง‘ํ•ฉ

์งˆ์˜์–ด๋ฅผ ์ฐพ์„ ๋•Œ ๋” ๋นจ๋ฆฌ ์ฐพ์„ ์ˆ˜ ์žˆ๋„๋ก ๋ฌธ์„œ๋ฅผ ์ •๋ ฌํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, Posting list๋Š” ๋ฐฐ์—ด๋ณด๋‹ค ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ๋ฅผ ์„ ํ˜ธํ•œ๋‹ค. ํ•˜์ง€๋งŒ ํฌ์ธํ„ฐ ๊ณต๊ฐ„์„ ๋” ์ฐจ์ง€ํ•œ๋‹ค.

Dictionary == Vocabulary (Term์˜ ์ง‘ํ•ฉ)

Inverted Index ๋งŒ๋“œ๋Š” ๊ณผ์ •

[๋‹จ๊ณ„]

  1. Tokenization (ํ† ํฐํ™”)→ ๋งŒ์•ฝ, ๋ฌธ์„œ๊ฐ€ TXT๊ฐ€ ์•„๋‹Œ ์•„๋ž˜ํ•œ๊ธ€, html์ด๋ผ๋ฉด? → ๊ฒฝ์šฐ์— ๋”ฐ๋ผ ํ† ํฐํ™” ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Œ.1-2. Linguistic Modules๋ฅผ ํ†ตํ•ด ์–ธ์–ด์— ๋”ฐ๋ผ ์ฒ˜๋ฆฌํ•จ. (์˜์–ด : ๋ชจ๋‘ ์†Œ๋ฌธ์ž๋กœ, ์›ํ˜•์œผ๋กœ ๋งŒ๋“ฆ)์–ด๋А ํ† ํฐ์ด ์–ด๋А ๋ฌธ์„œ์—์„œ ๋‚˜์™”๋Š”์ง€ Postings list(ํ•ด๋‹น ํ† ํฐ์˜ ๋ฌธ์„œ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ„) ๊ตฌ์ถ•1-3-1. (Term, DocID) ์Œ์œผ๋กœ ๋งŒ๋“ฆ.1-3-3. ํ•„์š” ์‹œ, TF ์ •๋ณด ์ถ”๊ฐ€Term Frequency(TF) : ํ•œ ๋ฌธ์„œ๋ฅผ ๊ธฐ์ค€์œผ๋กœ Term์ด ๋‚˜์˜จ ๊ฐœ์ˆ˜ ์ถ”๊ฐ€
        Document Frequency(DF) : Term์ด ๋‚˜์˜จ ๋ฌธ์„œ์˜ ๊ฐœ์ˆ˜ ์ถ”๊ฐ€
    
        ์ด๋•Œ Term Frequency๋ผ๋ฆฌ ํ•ฉ์„ ๊ตฌํ•จ.
    
    DF๋กœ ํ•ฉ์ณ์ง„ ์ค‘๋ณต Term์€ ๊ฐ๊ฐ ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ ์‹์œผ๋กœ ์ž์‹ ์˜ (DocID, TF)๋ฅผ ๊ฐ€๋ฆฌํ‚ด
  2. → ๋‚˜์ค‘์— ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ ํ˜•์‹์˜ Inverted Index์˜ ๊ธฐํ‹€์ด ๋จ.
  3. 1-3-5. ํ•„์š” ์‹œ, DF ์ •๋ณด ์ถ”๊ฐ€
  4. ํ•œ ๋ฌธ์„œ์˜ ์ค‘๋ณต Term์€ ํ•˜๋‚˜๋กœ ํ†ต์ผ.
  5. 1-3-2. Term์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•จ.
  6. Inverted Index๋ฅผ ๋งŒ๋“ฆ.
  7. 1-3. Indexer : ํ† ํฐ์„ ์ธ๋ฑ์Šค๋กœ ๋งŒ๋“ฆ.
  8. Tokenizer ๋ชจ๋“ˆ์„ ํ†ตํ•ด ํ† ํฐํ™” ํ•จ.
  9. 1-1. ํ† ํฐํ™” : ๋ฌธ์„œ๋“ค์˜ ๋ฌธ์žฅ์„ ๊ณต๋ฐฑ์„ ๊ธฐ์ค€์œผ๋กœ ์ž๋ฆ„

์‚ฌ๋žŒ์ด ์“ฐ๋Š” ๋‹จ์–ด๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ Dictionary size๋Š” Postings list์— ๋น„ํ•ด ๋ณ€ํ™”๊ฐ€ ์ ์Œ.

๋”ฐ๋ผ์„œ, Dictionary๋Š” ๊ฒ€์ƒ‰ ์„œ๋น„์Šค๋ฅผ ์œ„ํ•ด Main Memory๋กœ load ๋จ.

Postings๋Š” ํฌ๊ธฐ๊ฐ€ ์ ์  ์ปค์ง€๊ณ  ๋ณ€ํ™”๋„ ์žฆ์•„ Disk์— ์ €์žฅ๋จ.

Postings list ์ €์žฅ ๋ฐฉ๋ฒ•

  1. ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ
  2. ๊ฐ€๋ณ€ ๋ฐฐ์—ด
  3. hybrid scheme : ์—ฐ๊ฒฐ๋ฆฌ์ŠคํŠธ + ๊ฐ€๋ณ€ ๋ฐฐ์—ด (๋ฐฐ์—ด์„ ์—ฐ๊ฒฐ๋ฆฌ์ŠคํŠธ๋กœ ์—ฐ๊ฒฐํ•จ.)

์งˆ์˜์–ด

  1. Conjunctive : AND์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ํ•ฉif not sorting ์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ๊ณฑ
  2. ์กฐ๊ฑด : Posting list๊ฐ€ ์ •๋ ฌ๋˜์–ด ์žˆ์–ด์•ผ ํ•จ.
  3. Posting list๋ผ๋ฆฌ ์ฒซ ๋ฒˆ์งธ ์ธ๋ฑ์Šค๋ถ€ํ„ฐ ๋น„๊ตํ•ด์„œ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๊ฐ™์œผ๋ฉด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ , ๋‹ค๋ฅด๋ฉด ์ž‘์€ ์ชฝ์˜ Posting์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•˜๊ณ  ๋‹ค์‹œ ๋น„๊ต
  4. Disjunctive : OR์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ํ•ฉ
  5. Posting list๋ผ๋ฆฌ ์ฒซ ๋ฒˆ์งธ ์ธ๋ฑ์Šค๋ถ€ํ„ฐ ๋น„๊ตํ•ด์„œ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๊ฐ™์œผ๋ฉด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ , ๋‹ฌ๋ผ๋„ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๋” ์ž‘์€ ์ชฝ์„ ๋จผ์ € ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ  ์ž‘์€ ์ชฝ์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค.
  6. NOT → ์‹œ๊ฐ„ ์˜ค๋ž˜ ๊ฑธ๋ฆผNOT์ด ๋ถ™์€ Posting list๋Š” ์˜ค๋ฅธ์ชฝ์œผ๋กœ ๋„˜์–ด๊ฐ€๊ธฐ ์ „์—, ์ž์‹ ์—๊ฒŒ ์—†๋Š” Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ์™€ ๋‹ค๋ฅธ Posting list์™€ ์—ฐ์‚ฐํ•จ.NOT์ด ๋ถ™์€ Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ NOT์ด ๋ถ™์ง€ ์•Š์€ Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ์™€ ๋‹ค๋ฅด๋ฉด, ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๊ณ  ์ž‘์€ ์ชฝ์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค. ๊ฐ™์œผ๋ฉด ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๊ณ  ๋‘ Posting ๋ชจ๋‘ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค.
  7. <AND>
  8. NOT์ด ๋ถ™์€ ๋‹จ์–ด๊ฐ€ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จ๋˜๋ฉด ์•ˆ ๋จ.

WestLaw ์‹œ์Šคํ…œ

์งˆ์˜์–ด๋ฅผ ๋งŒ๋“ค ๋•Œ, AND, OR, NOT + /์ˆซ์ž, ! ๊ฐ€๋Šฅํ•จ.

๋„์–ด์“ฐ๊ธฐ : OR

/<์ˆซ์ž> : <์ˆซ์ž> ๋‹จ์–ด ์ด๋‚ด์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ ๋•Œ, AND

! : Wild-card ๋ฌธ์ž *

/s : ํ•œ ๋ฌธ์žฅ ์•ˆ์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‘˜ ๋‹ค ๋‚˜์™€์•ผ ํ•จ. AND

/p : ํ•œ ๋ฌธ๋‹จ ์•ˆ์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‘˜ ๋‹ค ๋‚˜์™€์•ผ ํ•จ. AND

Proximity operators : ์ธ์ ‘์„ฑ์„ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ๋Š” ์—ฐ์‚ฐ ๊ธฐํ˜ธ

์—ฐ์‚ฐ ์ˆœ์„œ ์ตœ์ ํ™”

  1. ์•ž๋ถ€ํ„ฐ ์ฒ˜๋ฆฌ
  2. ์ˆœ์„œ ๋ฐ”๊ฟ”์„œ ์ฒ˜๋ฆฌ → ์‹œ๊ฐ„ ์ ˆ์•ฝ ๊ฐ€๋ŠฅOR ์‹œ๊ฐ„ ๋ณต์žก๋„ : Posting list size์˜ ํ•ฉ
    1. Doc Freq๊ฐ€ ์ž‘์€ ๊ฒƒ๋ถ€ํ„ฐ ์ฒ˜๋ฆฌํ•จ.
    2. Posting list size == Doc Freq
    3. ์—ฐ์‚ฐ ์‹œ ์‹œ๊ฐ„ ๋ณต์žก๋„๋ฅผ ์ตœ๋Œ€๋กœ ๊ณ„์‚ฐํ•ด์„œ ์‹œ๊ฐ„ ๋ณต์žก๋„๊ฐ€ ์ž‘์€ ์ˆœ์„œ๋Œ€๋กœ ์ฒ˜๋ฆฌํ•จ.
  3. AND ์‹œ๊ฐ„ ๋ณต์žก๋„ : Posting list size์˜ ํ•ฉ

์ธ์ ‘์กฐ๊ฑด

์งˆ์˜์–ด์˜ ๋‹จ์–ด๊ฐ€ ๋„์–ด์“ฐ๊ธฐ ๋‹จ์œ„๋กœ ๋Š์–ด์ง€๋ฉด ์•ˆ ๋˜๊ณ  ๋ถ™์–ด์žˆ์–ด์•ผ ํ•  ๋•Œ

Proximity ์ธ์ ‘์กฐ๊ฑด ๋ช…๋ น์–ด : NEAR

๊ตฌ์กฐํ™”๋œ TEXT๊นŒ์ง€ ์ƒ๊ฐํ•ด์„œ ๊ฒ€์ƒ‰ํ•ด๋ณด๋Š” ๋ฐฉ๋ฒ•๋„ ์ƒ๊ฐํ•ด๋ณด๊ธฐ(์ €์ž์™€ ๋‚ด์šฉ ๊ตฌ์กฐ๊ฐ€ ๋งž๋Š” ๊ฒ€์ƒ‰)

Inverted Index + Position Information

Position Information : Posting์ด ํ•ด๋‹น ๋ฌธ์„œ๊ฐ€ ๊ทธ ๋ฌธ์„œ ๋‚ด์—์„œ ์–ด๋””์— ๋‚˜์™”๋Š”์ง€ ์œ„์น˜ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด list๋ฅผ ๊ฐ€๋ฆฌํ‚ด

Position Information Size = Term Frequency

๋‘ ๋‹จ์–ด๊ฐ€ ์ธ์ ‘ํ•ด ์žˆ๋Š”์ง€ ์ฐพ๋Š” ๋ฒ• : AND ์—ฐ์‚ฐ์œผ๋กœ ๊ฐ ๋ฌธ์„œ์— ํ•ด๋‹น Term ๋‚˜์™”๋Š”์ง€ ๊ฒ€์‚ฌ → ๋‘ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋น„๊ตํ•จ → ์ธ์ ‘ ์กฐ๊ฑด์— ๋งž์„ ๊ฒฝ์šฐ ๊ฒฐ๊ณผ์— ํฌํ•จ

์ธ์ ‘ ์ฐจ์ด : 1์ด๋ฉด ๋‘ ๋‹จ์–ด๊ฐ€ ๋ถ™์–ด์žˆ์Œ์„ ์˜๋ฏธํ•˜๊ณ , 2๋ฉด ๋‘ ๋‹จ์–ด ์‚ฌ์ด์— ํ•œ ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Œ์„ ์˜๋ฏธํ•จ.

Term Freq ํ™œ์šฉ

Term Freq๊ฐ€ ํฐ ์ˆœ์œผ๋กœ ๊ฒฐ๊ณผ ์ œ๊ณตํ•˜๊ธฐ

์–ด๋–ค Term์ด ํ•˜๋‚˜์˜ ๋ฌธ์„œ์—์„œ Document์—์„œ ๋ช‡ ๋ฒˆ ๋‚˜์™”๋Š”์ง€ ๊ณ ๋ ค

์–ด๋–ค Posting์˜ Term Freq == ๊ทธ Posting์˜ Positional Information ๋ฆฌ์ŠคํŠธ์˜ ๊ธธ์ด

๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋กœ, ๊ฐ ๋ฌธ์„œ์˜ Term Freq์˜ ํ•ฉ์ด ํฐ ๋ฌธ์„œ๋ถ€ํ„ฐ ์ œ๊ณตํ•จ.

Ranking Search

Boolean ๋ชจ๋ธ์€ ํ•ด๋‹น ์งˆ์˜์–ด๊ฐ€ ๋ฌธ์„œ์— ํฌํ•จ ๋๋Š”์ง€ ์•ˆ ๋๋Š”์ง€๋งŒ ํŒ๋‹จํ•˜๊ธฐ ๋•Œ๋ฌธ์— Ranking์„ ๋งค๊ธฐ๋Š” ๊ฑด ์›์น™์ ์œผ๋กœ๋Š” ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.

ํ•˜์ง€๋งŒ, ๊ธฐ์ค€์„ ์ ์šฉํ•ด์„œ ์กฐ๊ธˆ ๋” ์œ ์šฉํ•  ๋งŒํ•œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•  ์ˆ˜๋„ ์žˆ๋‹ค. Proximity ์ธ์ ‘์„ฑ์„ ์ด์šฉํ•ด์„œ ์ธ์ ‘ํ•  ์ˆ˜๋ก ์‚ฌ์šฉ์ž๊ฐ€ ๋” ์›ํ•˜๋Š” ์ •๋ณด์— ๋” ๊ฐ€๊นŒ์šธ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๊ฐ€๊นŒ์šด ์ˆœ์œผ๋กœ ์ œ๊ณตํ•˜๊ฑฐ๋‚˜, ๋ฌธ์„œ์˜ Term Freq๋ฅผ ์ด์šฉํ•ด์„œ ํฐ ์ˆœ์œผ๋กœ ์ œ๊ณตํ•˜๊ฑฐ๋‚˜, ๋ฌธ์„œ๊ฐ€ ๋งŒ๋“ค์–ด์ง„ ์‹œ๊ฐ„์„ ์ธ๋ฑ์Šค์— ๋ถ€์—ฌํ•ด์„œ ์ตœ๊ทผ ๋ฌธ์„œ๋ถ€ํ„ฐ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๊ธด ํ•˜๋‹ค.

์ •๋ณด ๊ฒ€์ƒ‰ VS DB ๊ฒ€์ƒ‰

์ •๋ณด ๊ฒ€์ƒ‰

๋น„๊ตฌ์กฐํ™” ๋จ

๋ฒ”์œ„ ๊ฒ€์ƒ‰ ๋ถˆ๊ฐ€๋Šฅ

๋‹จ, TEXT์ด๋”๋ผ๊ณ  ์•ฝ๊ฐ„์˜ ๊ตฌ์กฐ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Œ. Semi-Structured Data ์˜ˆ) PPT, XML

๊ธ€์ž์˜ ํฌ๊ธฐ, ๊ธ€์ž์˜ ๊ตต๊ธฐ์™€ ๊ฐ™์€ ์ •๋ณด๋ฅผ ์ด์šฉํ•ด ์•ฝ๊ฐ„์˜ ๊ตฌ์กฐ๋ฅผ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ.

DB ๊ฒ€์ƒ‰

๊ตฌ์กฐํ™”๋จ.

๋ฒ”์œ„ ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅ

๊ด€๋ จ ์žˆ๋Š” Concept ๊ฒ€์ƒ‰ : ์•„์ง ๋ถˆ๊ฐ€๋Šฅ

์šฉ์–ด

Clustering : ๊ตฐ์ง‘(๋ชจ์œผ๊ธฐ)

Classfication : ๋ถ„๋ฅ˜

์›น ๊ฒ€์ƒ‰

๋‹ค์–‘ํ•œ ๋ฌธ์„œ, ๋‹ค์–‘ํ•œ ์‚ฌ๋žŒ, ์งˆ์˜์–ด, ์ •๋ณด ์กด์žฌํ•จ.

๋งํฌ๋œ ์ •๋ณด ํ™œ์šฉํ•˜๊ฑฐ๋‚˜ ํด๋ฆญํ•œ ์ •๋ณด ํ™œ์šฉ ๊ฐ€๋Šฅํ•จ.

Cross-language information retrieval : ๊ต์ฐจ ์–ธ์–ด, ๋ฒˆ์—ญํ•ด์„œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ ์ œ๊ณต

Question answering : ์งˆ์˜ ์‘๋‹ต ์‹œ์Šคํ…œ ์งˆ๋ฌธ&๋‹ต๋ณ€

Summarization : ๊ฒ€์ƒ‰๊ฒฐ๊ณผ ์š”์•ฝํ•ด์„œ ์ œ๊ณต

TEXT mining : TEXT์—์„œ ํ•„์š”ํ•œ ์ •๋ณด ๋ฝ‘์•„์„œ ์•Œ๋ ค์คŒ

1996๋…„) TEXT์™€ ๊ฐ™์ด ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ์˜ ๊ทœ๋ชจ๊ฐ€ ๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋ณด๋‹ค ์ปธ์Œ. ๊ทผ๋ฐ ์‹œ์žฅ์—์„œ๋Š” ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ฑด ๋ˆ์ด ๋˜์ง€ ์•Š์•˜์Œ

2006๋…„) ์—ฌ์ „ํžˆ ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ์˜ ๊ทœ๋ชจ๊ฐ€ ํฌ์ง€๋งŒ TEXT ๊ฒ€์ƒ‰ ์„œ๋น„์Šค ์‹œ์žฅ์ด ์ปค์ง€๋ฉด์„œ ๋น„๊ตฌ์กฐํ™” ๋ฌธ์„œ๋ฅผ ๋‹ค๋ฃจ๋Š” ์‹œ์žฅ์ด ์„ฑ์žฅํ•จ

์ •๋ณด๊ฒ€์ƒ‰์‹œ์Šคํ…œ์ด ์—†๋‹ค๋ฉด, ์œ ๋‹‰์Šค์˜ grep๋ช…๋ น์–ด, | ๋กœ ์ฐพ์„ ์ˆ˜ ์žˆ์Œ. ํ•˜์ง€๋งŒ, ๋ฌธ์„œ์˜ ์–‘์ด ํฌ๋‹ค๋ฉด ์†๋„๊ฐ€ ๋А๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— ์ •๋ณด ๊ฒ€์ƒ‰ ๋ชฉ์ ์œผ๋กœ๋Š” ์ ์ ˆํ•˜์ง€ ์•Š๋‹ค. ํŠน์ • ๋‚ด์šฉ์ด ๋“ค์–ด๊ฐ€ ์žˆ๋Š” ๊ฑด ๋“ค์–ด๊ฐ€ ์žˆ์œผ๋ฉด ๋” ํƒ์ƒ‰ํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋‹ˆ ์‰ฌ์šด ํŽธ์ด์ง€๋งŒ, ํŠน์ • ๋‚ด์šฉ์ด ๋“ค์–ด๊ฐ€ ์žˆ์ง€ ์•Š์€ ๊ฑด ํŒŒ์ผ์„ ๋๊นŒ์ง€ ๋‹ค ํƒ์ƒ‰ํ•ด ๋ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋ ต๋‹ค. ๋‹จ์–ด๊ฐ€ ์ธ์ ‘ํ•ด ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋Š” grep ๋ช…๋ น์œผ๋กœ ์ฐพ์„ ์ˆœ ์—†๋‹ค.

Boolean ๋ชจ๋ธ์€ ๋‹จ์–ด ํฌํ•จ ์—ฌ๋ถ€๋งŒ ์•Œ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋กœ ์ถ”์ถœ๋œ ๋ฌธ์„œ๋“ค์˜ ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๊ธด ํž˜๋“ค๋‹ค. grep ๋˜ํ•œ, Boolean ๋ชจ๋ธ์— ๋”ฐ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๊ธด ํž˜๋“ค๋‹ค.

Boolean ๋ชจ๋ธ๋„ ์ ์ˆ˜๋ฅผ ๋งค๊ฒจ์„œ Sortํ•  ์ˆ˜ ์žˆ๊ธด ํ•˜์ง€๋งŒ, ๊ธฐ๋ณธ์ ์œผ๋ก  ์ˆœ์œ„ ๋งค๊ธธ ์ˆ˜ ์—†๋‹ค.

$ grep -v <์งˆ์˜์–ด> * : ๋ชจ๋“  ๋ฌธ์„œ ํŒŒ์ผ(*)์—์„œ ์งˆ์˜์–ด๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์€(-v) ๋ฌธ์„œ๋ฅผ ํƒ์ƒ‰ํ•œ๋‹ค.

Term-document (incidence) matrix Term๊ณผ Document์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ–‰๋ ฌ

๊ฒ€์ƒ‰ ๋Œ€์ƒ์ด ๋˜๋Š” ๋ชจ๋“  ๋ฌธ์„œ์— ๋Œ€ํ•ด, Term์ด Document์— ๋‚˜์˜ค๋ฉด 1, ๋‚˜์˜ค์ง€ ์•Š์œผ๋ฉด 0

์งˆ์˜์–ด์™€ ์ผ์น˜ํ•˜๋Š” Term์˜ ํ–‰๋“ค์„ ์ฐพ์•„์„œ ์ฃผ์–ด์ง„ ์งˆ์˜์–ด์— ๋”ฐ๋ผ ํ–‰๋ผ๋ฆฌ AND, OR, NOT ์—ฐ์‚ฐ ์ˆ˜ํ–‰

์ด๋•Œ, NOT์€ 1๊ณผ 0์„ ๋งž๋ฐ”๊ฟˆ.

bitwise : ๊ฐ™์€ ์œ„์น˜์˜ bit๋ผ๋ฆฌ ์—ฐ์‚ฐ

์˜ˆ) ์งˆ์˜์–ด : Brutus AND Caesar but NOT Calpurnia

Brutus = 110100, Caesar = 110111, Calpurnia = 010000

์—ฐ์‚ฐ : 110100 AND 110111 AND 101111 = 100100

→ 1, 4๋ฒˆ์งธ ๋ฌธ์„œ๊ฐ€ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จ๋˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ํฌํ•จ๋˜์ง€ ์•Š์Œ.

Term-document (incidence) matrix๊ฐ€ ์ฃผ์–ด์ ธ ์žˆ๋‹ค๋ฉด, ์งˆ์˜์–ด ํ‚ค์›Œ๋“œ์— ํ•ด๋‹นํ•˜๋Š” Term์˜ ํ–‰์„ ์ฐพ์•„์„œ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•ด๋‚ผ ์ˆ˜ ์žˆ์Œ. (Boolean ๊ฒ€์‚ฌ ๊ธฐ๋ณธ ๋ฐฉ๋ฒ•๋ก )

Boolean Model ๊ธฐ๋ณธ ๊ฐ€์ • : ๋ฌธ์„œ์˜ ์ง‘ํ•ฉ์€ ๊ณ ์ •๋˜์–ด ์žˆ์Œ.

๋ง๋ญ‰์น˜ : Corpus, ๋ฌธ์„œ์˜ ์ง‘ํ•ฉ

Refine : ์‚ฌ๋žŒ์ด๋‚˜ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์ด ์งˆ์˜์–ด๋ฅผ ๋” ์ •๊ตํ•˜๊ฒŒ ์ˆ˜์ •ํ•ด์„œ ์žฌ๊ฒ€์ƒ‰์„ ํ•˜๋Š” ํ–‰์œ„

๊ฒ€์ƒ‰๋„ ์‹œ์Šคํ…œ ํ‰๊ฐ€ ๊ธฐ์ค€

Precision ์ •๋ฐ€๋„ : ๊ฒฐ๊ณผ๊ฐ€ ์งˆ์˜์–ด์™€ ์–ผ๋งˆ๋‚˜ ๋ถ€ํ•ฉํ•˜๋Š”์ง€

๋ชจ๋ธ์ด True๋ผ๊ณ  ๋ถ„๋ฅ˜ํ•œ ๊ฒƒ ์ค‘์—์„œ ์‹ค์ œ True์ธ ๊ฒƒ์˜ ๋น„์œจ(=์ •๋‹ต๋ฅ )

์‹ค์ œ๋กœ ์‚ฌ์šฉ์ž๊ฐ€ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ฌธ์„œ / ๋ชจ๋ธ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ

์ •๋‹ต / ๋‚ด๊ฐ€

Recall ์žฌํ˜„์œจ : ์งˆ์˜์–ด์— ๋ถ€ํ•ฉํ•˜๋Š” ๋ฌธ์„œ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ฐพ์•„๋‚ด๋Š”์ง€

์‹ค์ œ True์ธ ๊ฒƒ ์ค‘์—์„œ ๋ชจ๋ธ์ด True๋ผ๊ณ  ์˜ˆ์ธกํ•œ ๊ฒƒ์˜ ๋น„์œจ(=์ œ๊ณต๋ฅ )

์ •๋‹ต / ์‹ค์ œ

์ •ํ™•๋„(Accuracy)๋Š” ์ •๋ณด ๊ฒ€์ƒ‰ ํ‰๊ฐ€์— ์“ฐ์ด์ง€ ์•Š์Œ

์ •๋ฐ€๋„๊ฐ€ ๋†’์œผ๋ฉด ์žฌํ˜„์œจ์ด ๋‚ฎ๊ณ , ์ •๋ฐ€๋„๊ฐ€ ๋‚ฎ์œผ๋ฉด ์žฌํ˜„์œจ์ด ๋†’์Œ.

์ •๋ฐ€๋„๊ฐ€ 100์ธ ๊ฒฝ์šฐ : ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ฌธ์„œ 100๊ฐœ์—์„œ 1๊ฐœ ์ฐพ์Œ. → ์žฌํ˜„์œจ์€ 0.01

์žฌํ˜„์œจ์ด 100์ธ ๊ฒฝ์šฐ : ์ฐพ์€ ๋ฌธ์„œ 100๊ฐœ ์ค‘ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ชจ๋“  ๋ฌธ์„œ๊ฐ€ 1๊ฐœ์ž„ → ์ •๋ฐ€๋„๋Š” 0.01

**F-measure : ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์— ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๊ณ  ๊ตฌํ•œ ํ‰๊ท **

→ ํ”ํžˆ ์•ŒํŒŒ(๊ฐ€์ค‘์น˜)๋ฅผ 0.5๋กœ ์ฃผ์–ด ์•„๋ž˜์™€ ๊ฐ™์ด ๋งŒ๋“ฆ

F-measure๋กœ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ ํ‰๊ฐ€์— ์‚ฌ์šฉํ•จ.

์˜ˆ์‹œ)

Q1. ๊ฒ€์ƒ‰ ๋Œ€์ƒ ๋ฌธ์„œ 100๋งŒ ๊ฐœ, ๊ฐ ๋ฌธ์„œ๋Š” ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์˜ ๋‹จ์–ด๊ฐ€ ์žˆ์Œ. ํ•œ ๋‹จ์–ด๋Š” ํ‰๊ท ์ ์œผ๋กœ ๋„์–ด์“ฐ๊ธฐ, ๋ถ€ํ˜ธ ํฌํ•จ 6bytes(6๊ธ€์ž). ๋ฌธ์„œ ์ „์ฒด ํฌ๊ธฐ๋Š”?

A1. ๋ฌธ์„œ ์ „์ฒด ํฌ๊ธฐ = 100๋งŒ * (1000 * 6bytes) = 6,000,000,000 = 60์–ตbytes = 6GB

Q2. 10์–ต(100๋งŒ * 1000 = 1,000,000,000)๊ฐœ ์˜ ๋‹จ์–ด ์ค‘, 500,000๊ฐœ์˜ ๋‹ค๋ฅธ ๋‹จ์–ด๊ฐ€ ์žˆ๋‹ค. Term-Document Matrix์˜ ํฌ๊ธฐ๋Š”?

A2.

Term ๊ฐœ์ˆ˜ = 50๋งŒ

Document ๊ฐœ์ˆ˜ = 100๋งŒ

Term-Document Matrix ํฌ๊ธฐ = 50๋งŒ * 100๋งŒ = 5์ฒœ ์–ต

→ Boolean ๊ฒ€์ƒ‰์„ ์œ„ํ•ด Term-Document Matrix๋ฅผ ๋ฏธ๋ฆฌ ๋งŒ๋“ค์–ด์•ผ ํ•˜๋Š”๋ฐ, ํฌ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ํผ.

5์ฒœ ์–ต๊ฐœ์˜ 0๊ณผ 1์ค‘ 0์˜ ๋น„์ค‘์ด ๋ณดํ†ต 99.8%๋ฅผ ์ฐจ์ง€ํ•˜์—ฌ 1์€ 0.2%๋ฅผ ์ฐจ์ง€ํ•˜์—ฌ 1000๊ฐœ๋กœ ๋งŽ์ง€ ์•Š์Œ.

๋”ฐ๋ผ์„œ, 1์ธ ๊ฒƒ๋งŒ ๋”ฐ๋กœ ํ‘œ์‹œํ•˜๋„๋ก ํ•จ. ⇒ Inverted Index (Inverted file)

0์ด ์ฐจ์ง€ํ•˜๋Š” ๋น„์œจ

term(ํ–‰) = 50๋งŒ๊ฐœ, document(์—ด) = 100๋งŒ๊ฐœ

ํ•˜๋‚˜์˜ ๋ฌธ์„œ์˜ term ์ •๋ณด = ํ•œ ์—ด(50๋งŒ๊ฐœ์˜ term), ํ•˜๋‚˜์˜ ๋ฌธ์„œ๋Š” 1000๊ฐœ์˜ ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋จ ⇒ ํ•˜๋‚˜์˜ ๋ฌธ์„œ์— term ์ค‘ ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์˜ ๋‹จ์–ด๋งŒ ๋‚˜์˜ค๋ฏ€๋กœ 50๋งŒ ๊ฐœ์˜ ํ•œ ์—ด์—์„œ 1์€ ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์ž„. 49๋งŒ9์ฒœ๊ฐœ๋Š” 0์ž„. ⇒ 50๋งŒ ๊ฐœ์—์„œ 1000๊ฐœ๊ฐ€ 1์ด์—ˆ์œผ๋‹ˆ 100๋งŒ๊ฐœ์—์„œ๋Š” 2000๊ฐœ๊ฐ€ 1์ž„. ⇒ 2์ฒœ/100๋งŒ = 0.002. 0.2%๊ฐ€ 1์ด๊ณ , 99.8%๊ฐ€ 0์ž„.

Inverted index (inverted file)

Term-Document Matrix์˜ ๊ณต๊ฐ„ ์ƒ์˜ ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด 1์ธ ๋ฌธ์„œ๋งŒ ๋ชฉ๋กํ™”ํ•จ.

๊ฒ€์ƒ‰ ๋Œ€์ƒ(๋ฌธ์„œ, ์›น, PC๊ฒฝ๋กœ)์— ๋”ฐ๋ผ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— Posting list์—๋Š” ๋…ผ๋ฆฌ์ ์œผ๋กœ ๋ฌธ์„œ์˜ ๋ฒˆํ˜ธ ์‹๋ณ„์ž๋ฅผ ๋ถ€์—ฌํ•จ.

Posting : ๋ฌธ์„œ ๋ฒˆํ˜ธ ์‹๋ณ„์ž ํ•˜๋‚˜ํ•˜๋‚˜

Posting list : Posting์ด ๋‚˜์—ด๋œ ํ•˜๋‚˜์˜ ๋ฐฐ์—ด(๋ฆฌ์ŠคํŠธ)

Posting lists : Posting list์˜ ์ง‘ํ•ฉ

์งˆ์˜์–ด๋ฅผ ์ฐพ์„ ๋•Œ ๋” ๋นจ๋ฆฌ ์ฐพ์„ ์ˆ˜ ์žˆ๋„๋ก ๋ฌธ์„œ๋ฅผ ์ •๋ ฌํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, Posting list๋Š” ๋ฐฐ์—ด๋ณด๋‹ค ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ๋ฅผ ์„ ํ˜ธํ•œ๋‹ค. ํ•˜์ง€๋งŒ ํฌ์ธํ„ฐ ๊ณต๊ฐ„์„ ๋” ์ฐจ์ง€ํ•œ๋‹ค.

Dictionary == Vocabulary (Term์˜ ์ง‘ํ•ฉ)

Inverted Index ๋งŒ๋“œ๋Š” ๊ณผ์ •

[๋‹จ๊ณ„]

  1. Tokenization (ํ† ํฐํ™”)→ ๋งŒ์•ฝ, ๋ฌธ์„œ๊ฐ€ TXT๊ฐ€ ์•„๋‹Œ ์•„๋ž˜ํ•œ๊ธ€, html์ด๋ผ๋ฉด? → ๊ฒฝ์šฐ์— ๋”ฐ๋ผ ํ† ํฐํ™” ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Œ.1-2. Linguistic Modules๋ฅผ ํ†ตํ•ด ์–ธ์–ด์— ๋”ฐ๋ผ ์ฒ˜๋ฆฌํ•จ. (์˜์–ด : ๋ชจ๋‘ ์†Œ๋ฌธ์ž๋กœ, ์›ํ˜•์œผ๋กœ ๋งŒ๋“ฆ)์–ด๋А ํ† ํฐ์ด ์–ด๋А ๋ฌธ์„œ์—์„œ ๋‚˜์™”๋Š”์ง€ Postings list(ํ•ด๋‹น ํ† ํฐ์˜ ๋ฌธ์„œ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ„) ๊ตฌ์ถ•1-3-1. (Term, DocID) ์Œ์œผ๋กœ ๋งŒ๋“ฆ.1-3-3. ํ•„์š” ์‹œ, TF ์ •๋ณด ์ถ”๊ฐ€Term Frequency(TF) : ํ•œ ๋ฌธ์„œ๋ฅผ ๊ธฐ์ค€์œผ๋กœ Term์ด ๋‚˜์˜จ ๊ฐœ์ˆ˜ ์ถ”๊ฐ€
        Document Frequency(DF) : Term์ด ๋‚˜์˜จ ๋ฌธ์„œ์˜ ๊ฐœ์ˆ˜ ์ถ”๊ฐ€
    
        ์ด๋•Œ Term Frequency๋ผ๋ฆฌ ํ•ฉ์„ ๊ตฌํ•จ.
    
    DF๋กœ ํ•ฉ์ณ์ง„ ์ค‘๋ณต Term์€ ๊ฐ๊ฐ ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ ์‹์œผ๋กœ ์ž์‹ ์˜ (DocID, TF)๋ฅผ ๊ฐ€๋ฆฌํ‚ด
  2. → ๋‚˜์ค‘์— ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ ํ˜•์‹์˜ Inverted Index์˜ ๊ธฐํ‹€์ด ๋จ.
  3. 1-3-5. ํ•„์š” ์‹œ, DF ์ •๋ณด ์ถ”๊ฐ€
  4. ํ•œ ๋ฌธ์„œ์˜ ์ค‘๋ณต Term์€ ํ•˜๋‚˜๋กœ ํ†ต์ผ.
  5. 1-3-2. Term์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•จ.
  6. Inverted Index๋ฅผ ๋งŒ๋“ฆ.
  7. 1-3. Indexer : ํ† ํฐ์„ ์ธ๋ฑ์Šค๋กœ ๋งŒ๋“ฆ.
  8. Tokenizer ๋ชจ๋“ˆ์„ ํ†ตํ•ด ํ† ํฐํ™” ํ•จ.
  9. 1-1. ํ† ํฐํ™” : ๋ฌธ์„œ๋“ค์˜ ๋ฌธ์žฅ์„ ๊ณต๋ฐฑ์„ ๊ธฐ์ค€์œผ๋กœ ์ž๋ฆ„

์‚ฌ๋žŒ์ด ์“ฐ๋Š” ๋‹จ์–ด๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ Dictionary size๋Š” Postings list์— ๋น„ํ•ด ๋ณ€ํ™”๊ฐ€ ์ ์Œ.

๋”ฐ๋ผ์„œ, Dictionary๋Š” ๊ฒ€์ƒ‰ ์„œ๋น„์Šค๋ฅผ ์œ„ํ•ด Main Memory๋กœ load ๋จ.

Postings๋Š” ํฌ๊ธฐ๊ฐ€ ์ ์  ์ปค์ง€๊ณ  ๋ณ€ํ™”๋„ ์žฆ์•„ Disk์— ์ €์žฅ๋จ.

Postings list ์ €์žฅ ๋ฐฉ๋ฒ•

  1. ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ
  2. ๊ฐ€๋ณ€ ๋ฐฐ์—ด
  3. hybrid scheme : ์—ฐ๊ฒฐ๋ฆฌ์ŠคํŠธ + ๊ฐ€๋ณ€ ๋ฐฐ์—ด (๋ฐฐ์—ด์„ ์—ฐ๊ฒฐ๋ฆฌ์ŠคํŠธ๋กœ ์—ฐ๊ฒฐํ•จ.)

์งˆ์˜์–ด

  1. Conjunctive : AND์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ํ•ฉif not sorting ์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ๊ณฑ
  2. ์กฐ๊ฑด : Posting list๊ฐ€ ์ •๋ ฌ๋˜์–ด ์žˆ์–ด์•ผ ํ•จ.
  3. Posting list๋ผ๋ฆฌ ์ฒซ ๋ฒˆ์งธ ์ธ๋ฑ์Šค๋ถ€ํ„ฐ ๋น„๊ตํ•ด์„œ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๊ฐ™์œผ๋ฉด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ , ๋‹ค๋ฅด๋ฉด ์ž‘์€ ์ชฝ์˜ Posting์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•˜๊ณ  ๋‹ค์‹œ ๋น„๊ต
  4. Disjunctive : OR์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ํ•ฉ
  5. Posting list๋ผ๋ฆฌ ์ฒซ ๋ฒˆ์งธ ์ธ๋ฑ์Šค๋ถ€ํ„ฐ ๋น„๊ตํ•ด์„œ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๊ฐ™์œผ๋ฉด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ , ๋‹ฌ๋ผ๋„ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๋” ์ž‘์€ ์ชฝ์„ ๋จผ์ € ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ  ์ž‘์€ ์ชฝ์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค.
  6. NOT → ์‹œ๊ฐ„ ์˜ค๋ž˜ ๊ฑธ๋ฆผNOT์ด ๋ถ™์€ Posting list๋Š” ์˜ค๋ฅธ์ชฝ์œผ๋กœ ๋„˜์–ด๊ฐ€๊ธฐ ์ „์—, ์ž์‹ ์—๊ฒŒ ์—†๋Š” Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ์™€ ๋‹ค๋ฅธ Posting list์™€ ์—ฐ์‚ฐํ•จ.NOT์ด ๋ถ™์€ Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ NOT์ด ๋ถ™์ง€ ์•Š์€ Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ์™€ ๋‹ค๋ฅด๋ฉด, ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๊ณ  ์ž‘์€ ์ชฝ์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค. ๊ฐ™์œผ๋ฉด ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๊ณ  ๋‘ Posting ๋ชจ๋‘ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค.
  7. <AND>
  8. NOT์ด ๋ถ™์€ ๋‹จ์–ด๊ฐ€ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จ๋˜๋ฉด ์•ˆ ๋จ.

WestLaw ์‹œ์Šคํ…œ

์งˆ์˜์–ด๋ฅผ ๋งŒ๋“ค ๋•Œ, AND, OR, NOT + /์ˆซ์ž, ! ๊ฐ€๋Šฅํ•จ.

๋„์–ด์“ฐ๊ธฐ : OR

/<์ˆซ์ž> : <์ˆซ์ž> ๋‹จ์–ด ์ด๋‚ด์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ ๋•Œ, AND

! : Wild-card ๋ฌธ์ž *

/s : ํ•œ ๋ฌธ์žฅ ์•ˆ์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‘˜ ๋‹ค ๋‚˜์™€์•ผ ํ•จ. AND

/p : ํ•œ ๋ฌธ๋‹จ ์•ˆ์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‘˜ ๋‹ค ๋‚˜์™€์•ผ ํ•จ. AND

Proximity operators : ์ธ์ ‘์„ฑ์„ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ๋Š” ์—ฐ์‚ฐ ๊ธฐํ˜ธ

์—ฐ์‚ฐ ์ˆœ์„œ ์ตœ์ ํ™”

  1. ์•ž๋ถ€ํ„ฐ ์ฒ˜๋ฆฌ
  2. ์ˆœ์„œ ๋ฐ”๊ฟ”์„œ ์ฒ˜๋ฆฌ → ์‹œ๊ฐ„ ์ ˆ์•ฝ ๊ฐ€๋ŠฅOR ์‹œ๊ฐ„ ๋ณต์žก๋„ : Posting list size์˜ ํ•ฉ
    1. Doc Freq๊ฐ€ ์ž‘์€ ๊ฒƒ๋ถ€ํ„ฐ ์ฒ˜๋ฆฌํ•จ.
    2. Posting list size == Doc Freq
    3. ์—ฐ์‚ฐ ์‹œ ์‹œ๊ฐ„ ๋ณต์žก๋„๋ฅผ ์ตœ๋Œ€๋กœ ๊ณ„์‚ฐํ•ด์„œ ์‹œ๊ฐ„ ๋ณต์žก๋„๊ฐ€ ์ž‘์€ ์ˆœ์„œ๋Œ€๋กœ ์ฒ˜๋ฆฌํ•จ.
  3. AND ์‹œ๊ฐ„ ๋ณต์žก๋„ : Posting list size์˜ ํ•ฉ

์ธ์ ‘์กฐ๊ฑด

์งˆ์˜์–ด์˜ ๋‹จ์–ด๊ฐ€ ๋„์–ด์“ฐ๊ธฐ ๋‹จ์œ„๋กœ ๋Š์–ด์ง€๋ฉด ์•ˆ ๋˜๊ณ  ๋ถ™์–ด์žˆ์–ด์•ผ ํ•  ๋•Œ

Proximity ์ธ์ ‘์กฐ๊ฑด ๋ช…๋ น์–ด : NEAR

๊ตฌ์กฐํ™”๋œ TEXT๊นŒ์ง€ ์ƒ๊ฐํ•ด์„œ ๊ฒ€์ƒ‰ํ•ด๋ณด๋Š” ๋ฐฉ๋ฒ•๋„ ์ƒ๊ฐํ•ด๋ณด๊ธฐ(์ €์ž์™€ ๋‚ด์šฉ ๊ตฌ์กฐ๊ฐ€ ๋งž๋Š” ๊ฒ€์ƒ‰)

Inverted Index + Position Information

Position Information : Posting์ด ํ•ด๋‹น ๋ฌธ์„œ๊ฐ€ ๊ทธ ๋ฌธ์„œ ๋‚ด์—์„œ ์–ด๋””์— ๋‚˜์™”๋Š”์ง€ ์œ„์น˜ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด list๋ฅผ ๊ฐ€๋ฆฌํ‚ด

Position Information Size = Term Frequency

๋‘ ๋‹จ์–ด๊ฐ€ ์ธ์ ‘ํ•ด ์žˆ๋Š”์ง€ ์ฐพ๋Š” ๋ฒ• : AND ์—ฐ์‚ฐ์œผ๋กœ ๊ฐ ๋ฌธ์„œ์— ํ•ด๋‹น Term ๋‚˜์™”๋Š”์ง€ ๊ฒ€์‚ฌ → ๋‘ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋น„๊ตํ•จ → ์ธ์ ‘ ์กฐ๊ฑด์— ๋งž์„ ๊ฒฝ์šฐ ๊ฒฐ๊ณผ์— ํฌํ•จ

์ธ์ ‘ ์ฐจ์ด : 1์ด๋ฉด ๋‘ ๋‹จ์–ด๊ฐ€ ๋ถ™์–ด์žˆ์Œ์„ ์˜๋ฏธํ•˜๊ณ , 2๋ฉด ๋‘ ๋‹จ์–ด ์‚ฌ์ด์— ํ•œ ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Œ์„ ์˜๋ฏธํ•จ.

Term Freq ํ™œ์šฉ

Term Freq๊ฐ€ ํฐ ์ˆœ์œผ๋กœ ๊ฒฐ๊ณผ ์ œ๊ณตํ•˜๊ธฐ

์–ด๋–ค Term์ด ํ•˜๋‚˜์˜ ๋ฌธ์„œ์—์„œ Document์—์„œ ๋ช‡ ๋ฒˆ ๋‚˜์™”๋Š”์ง€ ๊ณ ๋ ค

์–ด๋–ค Posting์˜ Term Freq == ๊ทธ Posting์˜ Positional Information ๋ฆฌ์ŠคํŠธ์˜ ๊ธธ์ด

๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋กœ, ๊ฐ ๋ฌธ์„œ์˜ Term Freq์˜ ํ•ฉ์ด ํฐ ๋ฌธ์„œ๋ถ€ํ„ฐ ์ œ๊ณตํ•จ.

Ranking Search

Boolean ๋ชจ๋ธ์€ ํ•ด๋‹น ์งˆ์˜์–ด๊ฐ€ ๋ฌธ์„œ์— ํฌํ•จ ๋๋Š”์ง€ ์•ˆ ๋๋Š”์ง€๋งŒ ํŒ๋‹จํ•˜๊ธฐ ๋•Œ๋ฌธ์— Ranking์„ ๋งค๊ธฐ๋Š” ๊ฑด ์›์น™์ ์œผ๋กœ๋Š” ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.

ํ•˜์ง€๋งŒ, ๊ธฐ์ค€์„ ์ ์šฉํ•ด์„œ ์กฐ๊ธˆ ๋” ์œ ์šฉํ•  ๋งŒํ•œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•  ์ˆ˜๋„ ์žˆ๋‹ค. Proximity ์ธ์ ‘์„ฑ์„ ์ด์šฉํ•ด์„œ ์ธ์ ‘ํ•  ์ˆ˜๋ก ์‚ฌ์šฉ์ž๊ฐ€ ๋” ์›ํ•˜๋Š” ์ •๋ณด์— ๋” ๊ฐ€๊นŒ์šธ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๊ฐ€๊นŒ์šด ์ˆœ์œผ๋กœ ์ œ๊ณตํ•˜๊ฑฐ๋‚˜, ๋ฌธ์„œ์˜ Term Freq๋ฅผ ์ด์šฉํ•ด์„œ ํฐ ์ˆœ์œผ๋กœ ์ œ๊ณตํ•˜๊ฑฐ๋‚˜, ๋ฌธ์„œ๊ฐ€ ๋งŒ๋“ค์–ด์ง„ ์‹œ๊ฐ„์„ ์ธ๋ฑ์Šค์— ๋ถ€์—ฌํ•ด์„œ ์ตœ๊ทผ ๋ฌธ์„œ๋ถ€ํ„ฐ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๊ธด ํ•˜๋‹ค.

์ •๋ณด ๊ฒ€์ƒ‰ VS DB ๊ฒ€์ƒ‰

์ •๋ณด ๊ฒ€์ƒ‰

๋น„๊ตฌ์กฐํ™” ๋จ

๋ฒ”์œ„ ๊ฒ€์ƒ‰ ๋ถˆ๊ฐ€๋Šฅ

๋‹จ, TEXT์ด๋”๋ผ๊ณ  ์•ฝ๊ฐ„์˜ ๊ตฌ์กฐ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Œ. Semi-Structured Data ์˜ˆ) PPT, XML

๊ธ€์ž์˜ ํฌ๊ธฐ, ๊ธ€์ž์˜ ๊ตต๊ธฐ์™€ ๊ฐ™์€ ์ •๋ณด๋ฅผ ์ด์šฉํ•ด ์•ฝ๊ฐ„์˜ ๊ตฌ์กฐ๋ฅผ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ.

DB ๊ฒ€์ƒ‰

๊ตฌ์กฐํ™”๋จ.

๋ฒ”์œ„ ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅ

๊ด€๋ จ ์žˆ๋Š” Concept ๊ฒ€์ƒ‰ : ์•„์ง ๋ถˆ๊ฐ€๋Šฅ

์šฉ์–ด

Clustering : ๊ตฐ์ง‘(๋ชจ์œผ๊ธฐ)

Classfication : ๋ถ„๋ฅ˜

์›น ๊ฒ€์ƒ‰

๋‹ค์–‘ํ•œ ๋ฌธ์„œ, ๋‹ค์–‘ํ•œ ์‚ฌ๋žŒ, ์งˆ์˜์–ด, ์ •๋ณด ์กด์žฌํ•จ.

๋งํฌ๋œ ์ •๋ณด ํ™œ์šฉํ•˜๊ฑฐ๋‚˜ ํด๋ฆญํ•œ ์ •๋ณด ํ™œ์šฉ ๊ฐ€๋Šฅํ•จ.

Cross-language information retrieval : ๊ต์ฐจ ์–ธ์–ด, ๋ฒˆ์—ญํ•ด์„œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ ์ œ๊ณต

Question answering : ์งˆ์˜ ์‘๋‹ต ์‹œ์Šคํ…œ ์งˆ๋ฌธ&๋‹ต๋ณ€

Summarization : ๊ฒ€์ƒ‰๊ฒฐ๊ณผ ์š”์•ฝํ•ด์„œ ์ œ๊ณต

TEXT mining : TEXT์—์„œ ํ•„์š”ํ•œ ์ •๋ณด ๋ฝ‘์•„์„œ ์•Œ๋ ค์คŒ

1996๋…„) TEXT์™€ ๊ฐ™์ด ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ์˜ ๊ทœ๋ชจ๊ฐ€ ๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋ณด๋‹ค ์ปธ์Œ. ๊ทผ๋ฐ ์‹œ์žฅ์—์„œ๋Š” ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ฑด ๋ˆ์ด ๋˜์ง€ ์•Š์•˜์Œ

2006๋…„) ์—ฌ์ „ํžˆ ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ์˜ ๊ทœ๋ชจ๊ฐ€ ํฌ์ง€๋งŒ TEXT ๊ฒ€์ƒ‰ ์„œ๋น„์Šค ์‹œ์žฅ์ด ์ปค์ง€๋ฉด์„œ ๋น„๊ตฌ์กฐํ™” ๋ฌธ์„œ๋ฅผ ๋‹ค๋ฃจ๋Š” ์‹œ์žฅ์ด ์„ฑ์žฅํ•จ

์ •๋ณด๊ฒ€์ƒ‰์‹œ์Šคํ…œ์ด ์—†๋‹ค๋ฉด, ์œ ๋‹‰์Šค์˜ grep๋ช…๋ น์–ด, | ๋กœ ์ฐพ์„ ์ˆ˜ ์žˆ์Œ. ํ•˜์ง€๋งŒ, ๋ฌธ์„œ์˜ ์–‘์ด ํฌ๋‹ค๋ฉด ์†๋„๊ฐ€ ๋А๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— ์ •๋ณด ๊ฒ€์ƒ‰ ๋ชฉ์ ์œผ๋กœ๋Š” ์ ์ ˆํ•˜์ง€ ์•Š๋‹ค. ํŠน์ • ๋‚ด์šฉ์ด ๋“ค์–ด๊ฐ€ ์žˆ๋Š” ๊ฑด ๋“ค์–ด๊ฐ€ ์žˆ์œผ๋ฉด ๋” ํƒ์ƒ‰ํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋‹ˆ ์‰ฌ์šด ํŽธ์ด์ง€๋งŒ, ํŠน์ • ๋‚ด์šฉ์ด ๋“ค์–ด๊ฐ€ ์žˆ์ง€ ์•Š์€ ๊ฑด ํŒŒ์ผ์„ ๋๊นŒ์ง€ ๋‹ค ํƒ์ƒ‰ํ•ด ๋ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋ ต๋‹ค. ๋‹จ์–ด๊ฐ€ ์ธ์ ‘ํ•ด ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋Š” grep ๋ช…๋ น์œผ๋กœ ์ฐพ์„ ์ˆœ ์—†๋‹ค.

Boolean ๋ชจ๋ธ์€ ๋‹จ์–ด ํฌํ•จ ์—ฌ๋ถ€๋งŒ ์•Œ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋กœ ์ถ”์ถœ๋œ ๋ฌธ์„œ๋“ค์˜ ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๊ธด ํž˜๋“ค๋‹ค. grep ๋˜ํ•œ, Boolean ๋ชจ๋ธ์— ๋”ฐ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๊ธด ํž˜๋“ค๋‹ค.

Boolean ๋ชจ๋ธ๋„ ์ ์ˆ˜๋ฅผ ๋งค๊ฒจ์„œ Sortํ•  ์ˆ˜ ์žˆ๊ธด ํ•˜์ง€๋งŒ, ๊ธฐ๋ณธ์ ์œผ๋ก  ์ˆœ์œ„ ๋งค๊ธธ ์ˆ˜ ์—†๋‹ค.

$ grep -v <์งˆ์˜์–ด> * : ๋ชจ๋“  ๋ฌธ์„œ ํŒŒ์ผ(*)์—์„œ ์งˆ์˜์–ด๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์€(-v) ๋ฌธ์„œ๋ฅผ ํƒ์ƒ‰ํ•œ๋‹ค.

Term-document (incidence) matrix Term๊ณผ Document์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ–‰๋ ฌ

๊ฒ€์ƒ‰ ๋Œ€์ƒ์ด ๋˜๋Š” ๋ชจ๋“  ๋ฌธ์„œ์— ๋Œ€ํ•ด, Term์ด Document์— ๋‚˜์˜ค๋ฉด 1, ๋‚˜์˜ค์ง€ ์•Š์œผ๋ฉด 0

์งˆ์˜์–ด์™€ ์ผ์น˜ํ•˜๋Š” Term์˜ ํ–‰๋“ค์„ ์ฐพ์•„์„œ ์ฃผ์–ด์ง„ ์งˆ์˜์–ด์— ๋”ฐ๋ผ ํ–‰๋ผ๋ฆฌ AND, OR, NOT ์—ฐ์‚ฐ ์ˆ˜ํ–‰

์ด๋•Œ, NOT์€ 1๊ณผ 0์„ ๋งž๋ฐ”๊ฟˆ.

bitwise : ๊ฐ™์€ ์œ„์น˜์˜ bit๋ผ๋ฆฌ ์—ฐ์‚ฐ

์˜ˆ) ์งˆ์˜์–ด : Brutus AND Caesar but NOT Calpurnia

Brutus = 110100, Caesar = 110111, Calpurnia = 010000

์—ฐ์‚ฐ : 110100 AND 110111 AND 101111 = 100100

→ 1, 4๋ฒˆ์งธ ๋ฌธ์„œ๊ฐ€ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จ๋˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ํฌํ•จ๋˜์ง€ ์•Š์Œ.

Term-document (incidence) matrix๊ฐ€ ์ฃผ์–ด์ ธ ์žˆ๋‹ค๋ฉด, ์งˆ์˜์–ด ํ‚ค์›Œ๋“œ์— ํ•ด๋‹นํ•˜๋Š” Term์˜ ํ–‰์„ ์ฐพ์•„์„œ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•ด๋‚ผ ์ˆ˜ ์žˆ์Œ. (Boolean ๊ฒ€์‚ฌ ๊ธฐ๋ณธ ๋ฐฉ๋ฒ•๋ก )

Boolean Model ๊ธฐ๋ณธ ๊ฐ€์ • : ๋ฌธ์„œ์˜ ์ง‘ํ•ฉ์€ ๊ณ ์ •๋˜์–ด ์žˆ์Œ.

๋ง๋ญ‰์น˜ : Corpus, ๋ฌธ์„œ์˜ ์ง‘ํ•ฉ

Refine : ์‚ฌ๋žŒ์ด๋‚˜ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์ด ์งˆ์˜์–ด๋ฅผ ๋” ์ •๊ตํ•˜๊ฒŒ ์ˆ˜์ •ํ•ด์„œ ์žฌ๊ฒ€์ƒ‰์„ ํ•˜๋Š” ํ–‰์œ„

๊ฒ€์ƒ‰๋„ ์‹œ์Šคํ…œ ํ‰๊ฐ€ ๊ธฐ์ค€

Precision ์ •๋ฐ€๋„ : ๊ฒฐ๊ณผ๊ฐ€ ์งˆ์˜์–ด์™€ ์–ผ๋งˆ๋‚˜ ๋ถ€ํ•ฉํ•˜๋Š”์ง€

๋ชจ๋ธ์ด True๋ผ๊ณ  ๋ถ„๋ฅ˜ํ•œ ๊ฒƒ ์ค‘์—์„œ ์‹ค์ œ True์ธ ๊ฒƒ์˜ ๋น„์œจ(=์ •๋‹ต๋ฅ )

์‹ค์ œ๋กœ ์‚ฌ์šฉ์ž๊ฐ€ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ฌธ์„œ / ๋ชจ๋ธ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ

์ •๋‹ต / ๋‚ด๊ฐ€

Recall ์žฌํ˜„์œจ : ์งˆ์˜์–ด์— ๋ถ€ํ•ฉํ•˜๋Š” ๋ฌธ์„œ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ฐพ์•„๋‚ด๋Š”์ง€

์‹ค์ œ True์ธ ๊ฒƒ ์ค‘์—์„œ ๋ชจ๋ธ์ด True๋ผ๊ณ  ์˜ˆ์ธกํ•œ ๊ฒƒ์˜ ๋น„์œจ(=์ œ๊ณต๋ฅ )

์ •๋‹ต / ์‹ค์ œ

์ •ํ™•๋„(Accuracy)๋Š” ์ •๋ณด ๊ฒ€์ƒ‰ ํ‰๊ฐ€์— ์“ฐ์ด์ง€ ์•Š์Œ

์ •๋ฐ€๋„๊ฐ€ ๋†’์œผ๋ฉด ์žฌํ˜„์œจ์ด ๋‚ฎ๊ณ , ์ •๋ฐ€๋„๊ฐ€ ๋‚ฎ์œผ๋ฉด ์žฌํ˜„์œจ์ด ๋†’์Œ.

์ •๋ฐ€๋„๊ฐ€ 100์ธ ๊ฒฝ์šฐ : ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ฌธ์„œ 100๊ฐœ์—์„œ 1๊ฐœ ์ฐพ์Œ. → ์žฌํ˜„์œจ์€ 0.01

์žฌํ˜„์œจ์ด 100์ธ ๊ฒฝ์šฐ : ์ฐพ์€ ๋ฌธ์„œ 100๊ฐœ ์ค‘ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ชจ๋“  ๋ฌธ์„œ๊ฐ€ 1๊ฐœ์ž„ → ์ •๋ฐ€๋„๋Š” 0.01

**F-measure : ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์— ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๊ณ  ๊ตฌํ•œ ํ‰๊ท **

→ ํ”ํžˆ ์•ŒํŒŒ(๊ฐ€์ค‘์น˜)๋ฅผ 0.5๋กœ ์ฃผ์–ด ์•„๋ž˜์™€ ๊ฐ™์ด ๋งŒ๋“ฆ

F-measure๋กœ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ ํ‰๊ฐ€์— ์‚ฌ์šฉํ•จ.

์˜ˆ์‹œ)

Q1. ๊ฒ€์ƒ‰ ๋Œ€์ƒ ๋ฌธ์„œ 100๋งŒ ๊ฐœ, ๊ฐ ๋ฌธ์„œ๋Š” ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์˜ ๋‹จ์–ด๊ฐ€ ์žˆ์Œ. ํ•œ ๋‹จ์–ด๋Š” ํ‰๊ท ์ ์œผ๋กœ ๋„์–ด์“ฐ๊ธฐ, ๋ถ€ํ˜ธ ํฌํ•จ 6bytes(6๊ธ€์ž). ๋ฌธ์„œ ์ „์ฒด ํฌ๊ธฐ๋Š”?

A1. ๋ฌธ์„œ ์ „์ฒด ํฌ๊ธฐ = 100๋งŒ * (1000 * 6bytes) = 6,000,000,000 = 60์–ตbytes = 6GB

Q2. 10์–ต(100๋งŒ * 1000 = 1,000,000,000)๊ฐœ ์˜ ๋‹จ์–ด ์ค‘, 500,000๊ฐœ์˜ ๋‹ค๋ฅธ ๋‹จ์–ด๊ฐ€ ์žˆ๋‹ค. Term-Document Matrix์˜ ํฌ๊ธฐ๋Š”?

A2.

Term ๊ฐœ์ˆ˜ = 50๋งŒ

Document ๊ฐœ์ˆ˜ = 100๋งŒ

Term-Document Matrix ํฌ๊ธฐ = 50๋งŒ * 100๋งŒ = 5์ฒœ ์–ต

→ Boolean ๊ฒ€์ƒ‰์„ ์œ„ํ•ด Term-Document Matrix๋ฅผ ๋ฏธ๋ฆฌ ๋งŒ๋“ค์–ด์•ผ ํ•˜๋Š”๋ฐ, ํฌ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ํผ.

5์ฒœ ์–ต๊ฐœ์˜ 0๊ณผ 1์ค‘ 0์˜ ๋น„์ค‘์ด ๋ณดํ†ต 99.8%๋ฅผ ์ฐจ์ง€ํ•˜์—ฌ 1์€ 0.2%๋ฅผ ์ฐจ์ง€ํ•˜์—ฌ 1000๊ฐœ๋กœ ๋งŽ์ง€ ์•Š์Œ.

๋”ฐ๋ผ์„œ, 1์ธ ๊ฒƒ๋งŒ ๋”ฐ๋กœ ํ‘œ์‹œํ•˜๋„๋ก ํ•จ. ⇒ Inverted Index (Inverted file)

0์ด ์ฐจ์ง€ํ•˜๋Š” ๋น„์œจ

term(ํ–‰) = 50๋งŒ๊ฐœ, document(์—ด) = 100๋งŒ๊ฐœ

ํ•˜๋‚˜์˜ ๋ฌธ์„œ์˜ term ์ •๋ณด = ํ•œ ์—ด(50๋งŒ๊ฐœ์˜ term), ํ•˜๋‚˜์˜ ๋ฌธ์„œ๋Š” 1000๊ฐœ์˜ ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋จ ⇒ ํ•˜๋‚˜์˜ ๋ฌธ์„œ์— term ์ค‘ ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์˜ ๋‹จ์–ด๋งŒ ๋‚˜์˜ค๋ฏ€๋กœ 50๋งŒ ๊ฐœ์˜ ํ•œ ์—ด์—์„œ 1์€ ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์ž„. 49๋งŒ9์ฒœ๊ฐœ๋Š” 0์ž„. ⇒ 50๋งŒ ๊ฐœ์—์„œ 1000๊ฐœ๊ฐ€ 1์ด์—ˆ์œผ๋‹ˆ 100๋งŒ๊ฐœ์—์„œ๋Š” 2000๊ฐœ๊ฐ€ 1์ž„. ⇒ 2์ฒœ/100๋งŒ = 0.002. 0.2%๊ฐ€ 1์ด๊ณ , 99.8%๊ฐ€ 0์ž„.

Inverted index (inverted file)

Term-Document Matrix์˜ ๊ณต๊ฐ„ ์ƒ์˜ ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด 1์ธ ๋ฌธ์„œ๋งŒ ๋ชฉ๋กํ™”ํ•จ.

๊ฒ€์ƒ‰ ๋Œ€์ƒ(๋ฌธ์„œ, ์›น, PC๊ฒฝ๋กœ)์— ๋”ฐ๋ผ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— Posting list์—๋Š” ๋…ผ๋ฆฌ์ ์œผ๋กœ ๋ฌธ์„œ์˜ ๋ฒˆํ˜ธ ์‹๋ณ„์ž๋ฅผ ๋ถ€์—ฌํ•จ.

Posting : ๋ฌธ์„œ ๋ฒˆํ˜ธ ์‹๋ณ„์ž ํ•˜๋‚˜ํ•˜๋‚˜

Posting list : Posting์ด ๋‚˜์—ด๋œ ํ•˜๋‚˜์˜ ๋ฐฐ์—ด(๋ฆฌ์ŠคํŠธ)

Posting lists : Posting list์˜ ์ง‘ํ•ฉ

์งˆ์˜์–ด๋ฅผ ์ฐพ์„ ๋•Œ ๋” ๋นจ๋ฆฌ ์ฐพ์„ ์ˆ˜ ์žˆ๋„๋ก ๋ฌธ์„œ๋ฅผ ์ •๋ ฌํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, Posting list๋Š” ๋ฐฐ์—ด๋ณด๋‹ค ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ๋ฅผ ์„ ํ˜ธํ•œ๋‹ค. ํ•˜์ง€๋งŒ ํฌ์ธํ„ฐ ๊ณต๊ฐ„์„ ๋” ์ฐจ์ง€ํ•œ๋‹ค.

Dictionary == Vocabulary (Term์˜ ์ง‘ํ•ฉ)

Inverted Index ๋งŒ๋“œ๋Š” ๊ณผ์ •

[๋‹จ๊ณ„]

  1. Tokenization (ํ† ํฐํ™”)→ ๋งŒ์•ฝ, ๋ฌธ์„œ๊ฐ€ TXT๊ฐ€ ์•„๋‹Œ ์•„๋ž˜ํ•œ๊ธ€, html์ด๋ผ๋ฉด? → ๊ฒฝ์šฐ์— ๋”ฐ๋ผ ํ† ํฐํ™” ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Œ.1-2. Linguistic Modules๋ฅผ ํ†ตํ•ด ์–ธ์–ด์— ๋”ฐ๋ผ ์ฒ˜๋ฆฌํ•จ. (์˜์–ด : ๋ชจ๋‘ ์†Œ๋ฌธ์ž๋กœ, ์›ํ˜•์œผ๋กœ ๋งŒ๋“ฆ)์–ด๋А ํ† ํฐ์ด ์–ด๋А ๋ฌธ์„œ์—์„œ ๋‚˜์™”๋Š”์ง€ Postings list(ํ•ด๋‹น ํ† ํฐ์˜ ๋ฌธ์„œ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ„) ๊ตฌ์ถ•1-3-1. (Term, DocID) ์Œ์œผ๋กœ ๋งŒ๋“ฆ.1-3-3. ํ•„์š” ์‹œ, TF ์ •๋ณด ์ถ”๊ฐ€Term Frequency(TF) : ํ•œ ๋ฌธ์„œ๋ฅผ ๊ธฐ์ค€์œผ๋กœ Term์ด ๋‚˜์˜จ ๊ฐœ์ˆ˜ ์ถ”๊ฐ€
        Document Frequency(DF) : Term์ด ๋‚˜์˜จ ๋ฌธ์„œ์˜ ๊ฐœ์ˆ˜ ์ถ”๊ฐ€
    
        ์ด๋•Œ Term Frequency๋ผ๋ฆฌ ํ•ฉ์„ ๊ตฌํ•จ.
    
    DF๋กœ ํ•ฉ์ณ์ง„ ์ค‘๋ณต Term์€ ๊ฐ๊ฐ ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ ์‹์œผ๋กœ ์ž์‹ ์˜ (DocID, TF)๋ฅผ ๊ฐ€๋ฆฌํ‚ด
  2. → ๋‚˜์ค‘์— ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ ํ˜•์‹์˜ Inverted Index์˜ ๊ธฐํ‹€์ด ๋จ.
  3. 1-3-5. ํ•„์š” ์‹œ, DF ์ •๋ณด ์ถ”๊ฐ€
  4. ํ•œ ๋ฌธ์„œ์˜ ์ค‘๋ณต Term์€ ํ•˜๋‚˜๋กœ ํ†ต์ผ.
  5. 1-3-2. Term์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•จ.
  6. Inverted Index๋ฅผ ๋งŒ๋“ฆ.
  7. 1-3. Indexer : ํ† ํฐ์„ ์ธ๋ฑ์Šค๋กœ ๋งŒ๋“ฆ.
  8. Tokenizer ๋ชจ๋“ˆ์„ ํ†ตํ•ด ํ† ํฐํ™” ํ•จ.
  9. 1-1. ํ† ํฐํ™” : ๋ฌธ์„œ๋“ค์˜ ๋ฌธ์žฅ์„ ๊ณต๋ฐฑ์„ ๊ธฐ์ค€์œผ๋กœ ์ž๋ฆ„

์‚ฌ๋žŒ์ด ์“ฐ๋Š” ๋‹จ์–ด๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ Dictionary size๋Š” Postings list์— ๋น„ํ•ด ๋ณ€ํ™”๊ฐ€ ์ ์Œ.

๋”ฐ๋ผ์„œ, Dictionary๋Š” ๊ฒ€์ƒ‰ ์„œ๋น„์Šค๋ฅผ ์œ„ํ•ด Main Memory๋กœ load ๋จ.

Postings๋Š” ํฌ๊ธฐ๊ฐ€ ์ ์  ์ปค์ง€๊ณ  ๋ณ€ํ™”๋„ ์žฆ์•„ Disk์— ์ €์žฅ๋จ.

Postings list ์ €์žฅ ๋ฐฉ๋ฒ•

  1. ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ
  2. ๊ฐ€๋ณ€ ๋ฐฐ์—ด
  3. hybrid scheme : ์—ฐ๊ฒฐ๋ฆฌ์ŠคํŠธ + ๊ฐ€๋ณ€ ๋ฐฐ์—ด (๋ฐฐ์—ด์„ ์—ฐ๊ฒฐ๋ฆฌ์ŠคํŠธ๋กœ ์—ฐ๊ฒฐํ•จ.)

์งˆ์˜์–ด

  1. Conjunctive : AND์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ํ•ฉif not sorting ์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ๊ณฑ
  2. ์กฐ๊ฑด : Posting list๊ฐ€ ์ •๋ ฌ๋˜์–ด ์žˆ์–ด์•ผ ํ•จ.
  3. Posting list๋ผ๋ฆฌ ์ฒซ ๋ฒˆ์งธ ์ธ๋ฑ์Šค๋ถ€ํ„ฐ ๋น„๊ตํ•ด์„œ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๊ฐ™์œผ๋ฉด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ , ๋‹ค๋ฅด๋ฉด ์ž‘์€ ์ชฝ์˜ Posting์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•˜๊ณ  ๋‹ค์‹œ ๋น„๊ต
  4. Disjunctive : OR์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ํ•ฉ
  5. Posting list๋ผ๋ฆฌ ์ฒซ ๋ฒˆ์งธ ์ธ๋ฑ์Šค๋ถ€ํ„ฐ ๋น„๊ตํ•ด์„œ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๊ฐ™์œผ๋ฉด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ , ๋‹ฌ๋ผ๋„ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๋” ์ž‘์€ ์ชฝ์„ ๋จผ์ € ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ  ์ž‘์€ ์ชฝ์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค.
  6. NOT → ์‹œ๊ฐ„ ์˜ค๋ž˜ ๊ฑธ๋ฆผNOT์ด ๋ถ™์€ Posting list๋Š” ์˜ค๋ฅธ์ชฝ์œผ๋กœ ๋„˜์–ด๊ฐ€๊ธฐ ์ „์—, ์ž์‹ ์—๊ฒŒ ์—†๋Š” Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ์™€ ๋‹ค๋ฅธ Posting list์™€ ์—ฐ์‚ฐํ•จ.NOT์ด ๋ถ™์€ Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ NOT์ด ๋ถ™์ง€ ์•Š์€ Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ์™€ ๋‹ค๋ฅด๋ฉด, ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๊ณ  ์ž‘์€ ์ชฝ์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค. ๊ฐ™์œผ๋ฉด ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๊ณ  ๋‘ Posting ๋ชจ๋‘ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค.
  7. <AND>
  8. NOT์ด ๋ถ™์€ ๋‹จ์–ด๊ฐ€ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จ๋˜๋ฉด ์•ˆ ๋จ.

WestLaw ์‹œ์Šคํ…œ

์งˆ์˜์–ด๋ฅผ ๋งŒ๋“ค ๋•Œ, AND, OR, NOT + /์ˆซ์ž, ! ๊ฐ€๋Šฅํ•จ.

๋„์–ด์“ฐ๊ธฐ : OR

/<์ˆซ์ž> : <์ˆซ์ž> ๋‹จ์–ด ์ด๋‚ด์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ ๋•Œ, AND

! : Wild-card ๋ฌธ์ž *

/s : ํ•œ ๋ฌธ์žฅ ์•ˆ์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‘˜ ๋‹ค ๋‚˜์™€์•ผ ํ•จ. AND

/p : ํ•œ ๋ฌธ๋‹จ ์•ˆ์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‘˜ ๋‹ค ๋‚˜์™€์•ผ ํ•จ. AND

Proximity operators : ์ธ์ ‘์„ฑ์„ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ๋Š” ์—ฐ์‚ฐ ๊ธฐํ˜ธ

์—ฐ์‚ฐ ์ˆœ์„œ ์ตœ์ ํ™”

  1. ์•ž๋ถ€ํ„ฐ ์ฒ˜๋ฆฌ
  2. ์ˆœ์„œ ๋ฐ”๊ฟ”์„œ ์ฒ˜๋ฆฌ → ์‹œ๊ฐ„ ์ ˆ์•ฝ ๊ฐ€๋ŠฅOR ์‹œ๊ฐ„ ๋ณต์žก๋„ : Posting list size์˜ ํ•ฉ
    1. Doc Freq๊ฐ€ ์ž‘์€ ๊ฒƒ๋ถ€ํ„ฐ ์ฒ˜๋ฆฌํ•จ.
    2. Posting list size == Doc Freq
    3. ์—ฐ์‚ฐ ์‹œ ์‹œ๊ฐ„ ๋ณต์žก๋„๋ฅผ ์ตœ๋Œ€๋กœ ๊ณ„์‚ฐํ•ด์„œ ์‹œ๊ฐ„ ๋ณต์žก๋„๊ฐ€ ์ž‘์€ ์ˆœ์„œ๋Œ€๋กœ ์ฒ˜๋ฆฌํ•จ.
  3. AND ์‹œ๊ฐ„ ๋ณต์žก๋„ : Posting list size์˜ ํ•ฉ

์ธ์ ‘์กฐ๊ฑด

์งˆ์˜์–ด์˜ ๋‹จ์–ด๊ฐ€ ๋„์–ด์“ฐ๊ธฐ ๋‹จ์œ„๋กœ ๋Š์–ด์ง€๋ฉด ์•ˆ ๋˜๊ณ  ๋ถ™์–ด์žˆ์–ด์•ผ ํ•  ๋•Œ

Proximity ์ธ์ ‘์กฐ๊ฑด ๋ช…๋ น์–ด : NEAR

๊ตฌ์กฐํ™”๋œ TEXT๊นŒ์ง€ ์ƒ๊ฐํ•ด์„œ ๊ฒ€์ƒ‰ํ•ด๋ณด๋Š” ๋ฐฉ๋ฒ•๋„ ์ƒ๊ฐํ•ด๋ณด๊ธฐ(์ €์ž์™€ ๋‚ด์šฉ ๊ตฌ์กฐ๊ฐ€ ๋งž๋Š” ๊ฒ€์ƒ‰)

Inverted Index + Position Information

Position Information : Posting์ด ํ•ด๋‹น ๋ฌธ์„œ๊ฐ€ ๊ทธ ๋ฌธ์„œ ๋‚ด์—์„œ ์–ด๋””์— ๋‚˜์™”๋Š”์ง€ ์œ„์น˜ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด list๋ฅผ ๊ฐ€๋ฆฌํ‚ด

Position Information Size = Term Frequency

๋‘ ๋‹จ์–ด๊ฐ€ ์ธ์ ‘ํ•ด ์žˆ๋Š”์ง€ ์ฐพ๋Š” ๋ฒ• : AND ์—ฐ์‚ฐ์œผ๋กœ ๊ฐ ๋ฌธ์„œ์— ํ•ด๋‹น Term ๋‚˜์™”๋Š”์ง€ ๊ฒ€์‚ฌ → ๋‘ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋น„๊ตํ•จ → ์ธ์ ‘ ์กฐ๊ฑด์— ๋งž์„ ๊ฒฝ์šฐ ๊ฒฐ๊ณผ์— ํฌํ•จ

์ธ์ ‘ ์ฐจ์ด : 1์ด๋ฉด ๋‘ ๋‹จ์–ด๊ฐ€ ๋ถ™์–ด์žˆ์Œ์„ ์˜๋ฏธํ•˜๊ณ , 2๋ฉด ๋‘ ๋‹จ์–ด ์‚ฌ์ด์— ํ•œ ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Œ์„ ์˜๋ฏธํ•จ.

Term Freq ํ™œ์šฉ

Term Freq๊ฐ€ ํฐ ์ˆœ์œผ๋กœ ๊ฒฐ๊ณผ ์ œ๊ณตํ•˜๊ธฐ

์–ด๋–ค Term์ด ํ•˜๋‚˜์˜ ๋ฌธ์„œ์—์„œ Document์—์„œ ๋ช‡ ๋ฒˆ ๋‚˜์™”๋Š”์ง€ ๊ณ ๋ ค

์–ด๋–ค Posting์˜ Term Freq == ๊ทธ Posting์˜ Positional Information ๋ฆฌ์ŠคํŠธ์˜ ๊ธธ์ด

๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋กœ, ๊ฐ ๋ฌธ์„œ์˜ Term Freq์˜ ํ•ฉ์ด ํฐ ๋ฌธ์„œ๋ถ€ํ„ฐ ์ œ๊ณตํ•จ.

Ranking Search

Boolean ๋ชจ๋ธ์€ ํ•ด๋‹น ์งˆ์˜์–ด๊ฐ€ ๋ฌธ์„œ์— ํฌํ•จ ๋๋Š”์ง€ ์•ˆ ๋๋Š”์ง€๋งŒ ํŒ๋‹จํ•˜๊ธฐ ๋•Œ๋ฌธ์— Ranking์„ ๋งค๊ธฐ๋Š” ๊ฑด ์›์น™์ ์œผ๋กœ๋Š” ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.

ํ•˜์ง€๋งŒ, ๊ธฐ์ค€์„ ์ ์šฉํ•ด์„œ ์กฐ๊ธˆ ๋” ์œ ์šฉํ•  ๋งŒํ•œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•  ์ˆ˜๋„ ์žˆ๋‹ค. Proximity ์ธ์ ‘์„ฑ์„ ์ด์šฉํ•ด์„œ ์ธ์ ‘ํ•  ์ˆ˜๋ก ์‚ฌ์šฉ์ž๊ฐ€ ๋” ์›ํ•˜๋Š” ์ •๋ณด์— ๋” ๊ฐ€๊นŒ์šธ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๊ฐ€๊นŒ์šด ์ˆœ์œผ๋กœ ์ œ๊ณตํ•˜๊ฑฐ๋‚˜, ๋ฌธ์„œ์˜ Term Freq๋ฅผ ์ด์šฉํ•ด์„œ ํฐ ์ˆœ์œผ๋กœ ์ œ๊ณตํ•˜๊ฑฐ๋‚˜, ๋ฌธ์„œ๊ฐ€ ๋งŒ๋“ค์–ด์ง„ ์‹œ๊ฐ„์„ ์ธ๋ฑ์Šค์— ๋ถ€์—ฌํ•ด์„œ ์ตœ๊ทผ ๋ฌธ์„œ๋ถ€ํ„ฐ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๊ธด ํ•˜๋‹ค.

์ •๋ณด ๊ฒ€์ƒ‰ VS DB ๊ฒ€์ƒ‰

์ •๋ณด ๊ฒ€์ƒ‰

๋น„๊ตฌ์กฐํ™” ๋จ

๋ฒ”์œ„ ๊ฒ€์ƒ‰ ๋ถˆ๊ฐ€๋Šฅ

๋‹จ, TEXT์ด๋”๋ผ๊ณ  ์•ฝ๊ฐ„์˜ ๊ตฌ์กฐ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Œ. Semi-Structured Data ์˜ˆ) PPT, XML

๊ธ€์ž์˜ ํฌ๊ธฐ, ๊ธ€์ž์˜ ๊ตต๊ธฐ์™€ ๊ฐ™์€ ์ •๋ณด๋ฅผ ์ด์šฉํ•ด ์•ฝ๊ฐ„์˜ ๊ตฌ์กฐ๋ฅผ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ.

DB ๊ฒ€์ƒ‰

๊ตฌ์กฐํ™”๋จ.

๋ฒ”์œ„ ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅ

๊ด€๋ จ ์žˆ๋Š” Concept ๊ฒ€์ƒ‰ : ์•„์ง ๋ถˆ๊ฐ€๋Šฅ

์šฉ์–ด

Clustering : ๊ตฐ์ง‘(๋ชจ์œผ๊ธฐ)

Classfication : ๋ถ„๋ฅ˜

์›น ๊ฒ€์ƒ‰

๋‹ค์–‘ํ•œ ๋ฌธ์„œ, ๋‹ค์–‘ํ•œ ์‚ฌ๋žŒ, ์งˆ์˜์–ด, ์ •๋ณด ์กด์žฌํ•จ.

๋งํฌ๋œ ์ •๋ณด ํ™œ์šฉํ•˜๊ฑฐ๋‚˜ ํด๋ฆญํ•œ ์ •๋ณด ํ™œ์šฉ ๊ฐ€๋Šฅํ•จ.

Cross-language information retrieval : ๊ต์ฐจ ์–ธ์–ด, ๋ฒˆ์—ญํ•ด์„œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ ์ œ๊ณต

Question answering : ์งˆ์˜ ์‘๋‹ต ์‹œ์Šคํ…œ ์งˆ๋ฌธ&๋‹ต๋ณ€

Summarization : ๊ฒ€์ƒ‰๊ฒฐ๊ณผ ์š”์•ฝํ•ด์„œ ์ œ๊ณต

TEXT mining : TEXT์—์„œ ํ•„์š”ํ•œ ์ •๋ณด ๋ฝ‘์•„์„œ ์•Œ๋ ค์คŒ

1996๋…„) TEXT์™€ ๊ฐ™์ด ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ์˜ ๊ทœ๋ชจ๊ฐ€ ๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋ณด๋‹ค ์ปธ์Œ. ๊ทผ๋ฐ ์‹œ์žฅ์—์„œ๋Š” ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ฑด ๋ˆ์ด ๋˜์ง€ ์•Š์•˜์Œ

2006๋…„) ์—ฌ์ „ํžˆ ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ์˜ ๊ทœ๋ชจ๊ฐ€ ํฌ์ง€๋งŒ TEXT ๊ฒ€์ƒ‰ ์„œ๋น„์Šค ์‹œ์žฅ์ด ์ปค์ง€๋ฉด์„œ ๋น„๊ตฌ์กฐํ™” ๋ฌธ์„œ๋ฅผ ๋‹ค๋ฃจ๋Š” ์‹œ์žฅ์ด ์„ฑ์žฅํ•จ

์ •๋ณด๊ฒ€์ƒ‰์‹œ์Šคํ…œ์ด ์—†๋‹ค๋ฉด, ์œ ๋‹‰์Šค์˜ grep๋ช…๋ น์–ด, | ๋กœ ์ฐพ์„ ์ˆ˜ ์žˆ์Œ. ํ•˜์ง€๋งŒ, ๋ฌธ์„œ์˜ ์–‘์ด ํฌ๋‹ค๋ฉด ์†๋„๊ฐ€ ๋А๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— ์ •๋ณด ๊ฒ€์ƒ‰ ๋ชฉ์ ์œผ๋กœ๋Š” ์ ์ ˆํ•˜์ง€ ์•Š๋‹ค. ํŠน์ • ๋‚ด์šฉ์ด ๋“ค์–ด๊ฐ€ ์žˆ๋Š” ๊ฑด ๋“ค์–ด๊ฐ€ ์žˆ์œผ๋ฉด ๋” ํƒ์ƒ‰ํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋‹ˆ ์‰ฌ์šด ํŽธ์ด์ง€๋งŒ, ํŠน์ • ๋‚ด์šฉ์ด ๋“ค์–ด๊ฐ€ ์žˆ์ง€ ์•Š์€ ๊ฑด ํŒŒ์ผ์„ ๋๊นŒ์ง€ ๋‹ค ํƒ์ƒ‰ํ•ด ๋ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋ ต๋‹ค. ๋‹จ์–ด๊ฐ€ ์ธ์ ‘ํ•ด ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋Š” grep ๋ช…๋ น์œผ๋กœ ์ฐพ์„ ์ˆœ ์—†๋‹ค.

Boolean ๋ชจ๋ธ์€ ๋‹จ์–ด ํฌํ•จ ์—ฌ๋ถ€๋งŒ ์•Œ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋กœ ์ถ”์ถœ๋œ ๋ฌธ์„œ๋“ค์˜ ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๊ธด ํž˜๋“ค๋‹ค. grep ๋˜ํ•œ, Boolean ๋ชจ๋ธ์— ๋”ฐ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๊ธด ํž˜๋“ค๋‹ค.

Boolean ๋ชจ๋ธ๋„ ์ ์ˆ˜๋ฅผ ๋งค๊ฒจ์„œ Sortํ•  ์ˆ˜ ์žˆ๊ธด ํ•˜์ง€๋งŒ, ๊ธฐ๋ณธ์ ์œผ๋ก  ์ˆœ์œ„ ๋งค๊ธธ ์ˆ˜ ์—†๋‹ค.

$ grep -v <์งˆ์˜์–ด> * : ๋ชจ๋“  ๋ฌธ์„œ ํŒŒ์ผ(*)์—์„œ ์งˆ์˜์–ด๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์€(-v) ๋ฌธ์„œ๋ฅผ ํƒ์ƒ‰ํ•œ๋‹ค.

Term-document (incidence) matrix Term๊ณผ Document์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ–‰๋ ฌ

๊ฒ€์ƒ‰ ๋Œ€์ƒ์ด ๋˜๋Š” ๋ชจ๋“  ๋ฌธ์„œ์— ๋Œ€ํ•ด, Term์ด Document์— ๋‚˜์˜ค๋ฉด 1, ๋‚˜์˜ค์ง€ ์•Š์œผ๋ฉด 0

์งˆ์˜์–ด์™€ ์ผ์น˜ํ•˜๋Š” Term์˜ ํ–‰๋“ค์„ ์ฐพ์•„์„œ ์ฃผ์–ด์ง„ ์งˆ์˜์–ด์— ๋”ฐ๋ผ ํ–‰๋ผ๋ฆฌ AND, OR, NOT ์—ฐ์‚ฐ ์ˆ˜ํ–‰

์ด๋•Œ, NOT์€ 1๊ณผ 0์„ ๋งž๋ฐ”๊ฟˆ.

bitwise : ๊ฐ™์€ ์œ„์น˜์˜ bit๋ผ๋ฆฌ ์—ฐ์‚ฐ

์˜ˆ) ์งˆ์˜์–ด : Brutus AND Caesar but NOT Calpurnia

Brutus = 110100, Caesar = 110111, Calpurnia = 010000

์—ฐ์‚ฐ : 110100 AND 110111 AND 101111 = 100100

→ 1, 4๋ฒˆ์งธ ๋ฌธ์„œ๊ฐ€ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จ๋˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ํฌํ•จ๋˜์ง€ ์•Š์Œ.

Term-document (incidence) matrix๊ฐ€ ์ฃผ์–ด์ ธ ์žˆ๋‹ค๋ฉด, ์งˆ์˜์–ด ํ‚ค์›Œ๋“œ์— ํ•ด๋‹นํ•˜๋Š” Term์˜ ํ–‰์„ ์ฐพ์•„์„œ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•ด๋‚ผ ์ˆ˜ ์žˆ์Œ. (Boolean ๊ฒ€์‚ฌ ๊ธฐ๋ณธ ๋ฐฉ๋ฒ•๋ก )

Boolean Model ๊ธฐ๋ณธ ๊ฐ€์ • : ๋ฌธ์„œ์˜ ์ง‘ํ•ฉ์€ ๊ณ ์ •๋˜์–ด ์žˆ์Œ.

๋ง๋ญ‰์น˜ : Corpus, ๋ฌธ์„œ์˜ ์ง‘ํ•ฉ

Refine : ์‚ฌ๋žŒ์ด๋‚˜ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์ด ์งˆ์˜์–ด๋ฅผ ๋” ์ •๊ตํ•˜๊ฒŒ ์ˆ˜์ •ํ•ด์„œ ์žฌ๊ฒ€์ƒ‰์„ ํ•˜๋Š” ํ–‰์œ„

๊ฒ€์ƒ‰๋„ ์‹œ์Šคํ…œ ํ‰๊ฐ€ ๊ธฐ์ค€

Precision ์ •๋ฐ€๋„ : ๊ฒฐ๊ณผ๊ฐ€ ์งˆ์˜์–ด์™€ ์–ผ๋งˆ๋‚˜ ๋ถ€ํ•ฉํ•˜๋Š”์ง€

๋ชจ๋ธ์ด True๋ผ๊ณ  ๋ถ„๋ฅ˜ํ•œ ๊ฒƒ ์ค‘์—์„œ ์‹ค์ œ True์ธ ๊ฒƒ์˜ ๋น„์œจ(=์ •๋‹ต๋ฅ )

์‹ค์ œ๋กœ ์‚ฌ์šฉ์ž๊ฐ€ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ฌธ์„œ / ๋ชจ๋ธ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ

์ •๋‹ต / ๋‚ด๊ฐ€

Recall ์žฌํ˜„์œจ : ์งˆ์˜์–ด์— ๋ถ€ํ•ฉํ•˜๋Š” ๋ฌธ์„œ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ฐพ์•„๋‚ด๋Š”์ง€

์‹ค์ œ True์ธ ๊ฒƒ ์ค‘์—์„œ ๋ชจ๋ธ์ด True๋ผ๊ณ  ์˜ˆ์ธกํ•œ ๊ฒƒ์˜ ๋น„์œจ(=์ œ๊ณต๋ฅ )

์ •๋‹ต / ์‹ค์ œ

์ •ํ™•๋„(Accuracy)๋Š” ์ •๋ณด ๊ฒ€์ƒ‰ ํ‰๊ฐ€์— ์“ฐ์ด์ง€ ์•Š์Œ

์ •๋ฐ€๋„๊ฐ€ ๋†’์œผ๋ฉด ์žฌํ˜„์œจ์ด ๋‚ฎ๊ณ , ์ •๋ฐ€๋„๊ฐ€ ๋‚ฎ์œผ๋ฉด ์žฌํ˜„์œจ์ด ๋†’์Œ.

์ •๋ฐ€๋„๊ฐ€ 100์ธ ๊ฒฝ์šฐ : ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ฌธ์„œ 100๊ฐœ์—์„œ 1๊ฐœ ์ฐพ์Œ. → ์žฌํ˜„์œจ์€ 0.01

์žฌํ˜„์œจ์ด 100์ธ ๊ฒฝ์šฐ : ์ฐพ์€ ๋ฌธ์„œ 100๊ฐœ ์ค‘ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ชจ๋“  ๋ฌธ์„œ๊ฐ€ 1๊ฐœ์ž„ → ์ •๋ฐ€๋„๋Š” 0.01

**F-measure : ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์— ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๊ณ  ๊ตฌํ•œ ํ‰๊ท **

→ ํ”ํžˆ ์•ŒํŒŒ(๊ฐ€์ค‘์น˜)๋ฅผ 0.5๋กœ ์ฃผ์–ด ์•„๋ž˜์™€ ๊ฐ™์ด ๋งŒ๋“ฆ

F-measure๋กœ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ ํ‰๊ฐ€์— ์‚ฌ์šฉํ•จ.

์˜ˆ์‹œ)

Q1. ๊ฒ€์ƒ‰ ๋Œ€์ƒ ๋ฌธ์„œ 100๋งŒ ๊ฐœ, ๊ฐ ๋ฌธ์„œ๋Š” ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์˜ ๋‹จ์–ด๊ฐ€ ์žˆ์Œ. ํ•œ ๋‹จ์–ด๋Š” ํ‰๊ท ์ ์œผ๋กœ ๋„์–ด์“ฐ๊ธฐ, ๋ถ€ํ˜ธ ํฌํ•จ 6bytes(6๊ธ€์ž). ๋ฌธ์„œ ์ „์ฒด ํฌ๊ธฐ๋Š”?

A1. ๋ฌธ์„œ ์ „์ฒด ํฌ๊ธฐ = 100๋งŒ * (1000 * 6bytes) = 6,000,000,000 = 60์–ตbytes = 6GB

Q2. 10์–ต(100๋งŒ * 1000 = 1,000,000,000)๊ฐœ ์˜ ๋‹จ์–ด ์ค‘, 500,000๊ฐœ์˜ ๋‹ค๋ฅธ ๋‹จ์–ด๊ฐ€ ์žˆ๋‹ค. Term-Document Matrix์˜ ํฌ๊ธฐ๋Š”?

A2.

Term ๊ฐœ์ˆ˜ = 50๋งŒ

Document ๊ฐœ์ˆ˜ = 100๋งŒ

Term-Document Matrix ํฌ๊ธฐ = 50๋งŒ * 100๋งŒ = 5์ฒœ ์–ต

→ Boolean ๊ฒ€์ƒ‰์„ ์œ„ํ•ด Term-Document Matrix๋ฅผ ๋ฏธ๋ฆฌ ๋งŒ๋“ค์–ด์•ผ ํ•˜๋Š”๋ฐ, ํฌ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ํผ.

5์ฒœ ์–ต๊ฐœ์˜ 0๊ณผ 1์ค‘ 0์˜ ๋น„์ค‘์ด ๋ณดํ†ต 99.8%๋ฅผ ์ฐจ์ง€ํ•˜์—ฌ 1์€ 0.2%๋ฅผ ์ฐจ์ง€ํ•˜์—ฌ 1000๊ฐœ๋กœ ๋งŽ์ง€ ์•Š์Œ.

๋”ฐ๋ผ์„œ, 1์ธ ๊ฒƒ๋งŒ ๋”ฐ๋กœ ํ‘œ์‹œํ•˜๋„๋ก ํ•จ. ⇒ Inverted Index (Inverted file)

0์ด ์ฐจ์ง€ํ•˜๋Š” ๋น„์œจ

term(ํ–‰) = 50๋งŒ๊ฐœ, document(์—ด) = 100๋งŒ๊ฐœ

ํ•˜๋‚˜์˜ ๋ฌธ์„œ์˜ term ์ •๋ณด = ํ•œ ์—ด(50๋งŒ๊ฐœ์˜ term), ํ•˜๋‚˜์˜ ๋ฌธ์„œ๋Š” 1000๊ฐœ์˜ ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋จ ⇒ ํ•˜๋‚˜์˜ ๋ฌธ์„œ์— term ์ค‘ ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์˜ ๋‹จ์–ด๋งŒ ๋‚˜์˜ค๋ฏ€๋กœ 50๋งŒ ๊ฐœ์˜ ํ•œ ์—ด์—์„œ 1์€ ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์ž„. 49๋งŒ9์ฒœ๊ฐœ๋Š” 0์ž„. ⇒ 50๋งŒ ๊ฐœ์—์„œ 1000๊ฐœ๊ฐ€ 1์ด์—ˆ์œผ๋‹ˆ 100๋งŒ๊ฐœ์—์„œ๋Š” 2000๊ฐœ๊ฐ€ 1์ž„. ⇒ 2์ฒœ/100๋งŒ = 0.002. 0.2%๊ฐ€ 1์ด๊ณ , 99.8%๊ฐ€ 0์ž„.

Inverted index (inverted file)

Term-Document Matrix์˜ ๊ณต๊ฐ„ ์ƒ์˜ ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด 1์ธ ๋ฌธ์„œ๋งŒ ๋ชฉ๋กํ™”ํ•จ.

๊ฒ€์ƒ‰ ๋Œ€์ƒ(๋ฌธ์„œ, ์›น, PC๊ฒฝ๋กœ)์— ๋”ฐ๋ผ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— Posting list์—๋Š” ๋…ผ๋ฆฌ์ ์œผ๋กœ ๋ฌธ์„œ์˜ ๋ฒˆํ˜ธ ์‹๋ณ„์ž๋ฅผ ๋ถ€์—ฌํ•จ.

Posting : ๋ฌธ์„œ ๋ฒˆํ˜ธ ์‹๋ณ„์ž ํ•˜๋‚˜ํ•˜๋‚˜

Posting list : Posting์ด ๋‚˜์—ด๋œ ํ•˜๋‚˜์˜ ๋ฐฐ์—ด(๋ฆฌ์ŠคํŠธ)

Posting lists : Posting list์˜ ์ง‘ํ•ฉ

์งˆ์˜์–ด๋ฅผ ์ฐพ์„ ๋•Œ ๋” ๋นจ๋ฆฌ ์ฐพ์„ ์ˆ˜ ์žˆ๋„๋ก ๋ฌธ์„œ๋ฅผ ์ •๋ ฌํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, Posting list๋Š” ๋ฐฐ์—ด๋ณด๋‹ค ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ๋ฅผ ์„ ํ˜ธํ•œ๋‹ค. ํ•˜์ง€๋งŒ ํฌ์ธํ„ฐ ๊ณต๊ฐ„์„ ๋” ์ฐจ์ง€ํ•œ๋‹ค.

Dictionary == Vocabulary (Term์˜ ์ง‘ํ•ฉ)

Inverted Index ๋งŒ๋“œ๋Š” ๊ณผ์ •

[๋‹จ๊ณ„]

  1. Tokenization (ํ† ํฐํ™”)→ ๋งŒ์•ฝ, ๋ฌธ์„œ๊ฐ€ TXT๊ฐ€ ์•„๋‹Œ ์•„๋ž˜ํ•œ๊ธ€, html์ด๋ผ๋ฉด? → ๊ฒฝ์šฐ์— ๋”ฐ๋ผ ํ† ํฐํ™” ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Œ.1-2. Linguistic Modules๋ฅผ ํ†ตํ•ด ์–ธ์–ด์— ๋”ฐ๋ผ ์ฒ˜๋ฆฌํ•จ. (์˜์–ด : ๋ชจ๋‘ ์†Œ๋ฌธ์ž๋กœ, ์›ํ˜•์œผ๋กœ ๋งŒ๋“ฆ)์–ด๋А ํ† ํฐ์ด ์–ด๋А ๋ฌธ์„œ์—์„œ ๋‚˜์™”๋Š”์ง€ Postings list(ํ•ด๋‹น ํ† ํฐ์˜ ๋ฌธ์„œ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ„) ๊ตฌ์ถ•1-3-1. (Term, DocID) ์Œ์œผ๋กœ ๋งŒ๋“ฆ.1-3-3. ํ•„์š” ์‹œ, TF ์ •๋ณด ์ถ”๊ฐ€Term Frequency(TF) : ํ•œ ๋ฌธ์„œ๋ฅผ ๊ธฐ์ค€์œผ๋กœ Term์ด ๋‚˜์˜จ ๊ฐœ์ˆ˜ ์ถ”๊ฐ€
        Document Frequency(DF) : Term์ด ๋‚˜์˜จ ๋ฌธ์„œ์˜ ๊ฐœ์ˆ˜ ์ถ”๊ฐ€
    
        ์ด๋•Œ Term Frequency๋ผ๋ฆฌ ํ•ฉ์„ ๊ตฌํ•จ.
    
    DF๋กœ ํ•ฉ์ณ์ง„ ์ค‘๋ณต Term์€ ๊ฐ๊ฐ ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ ์‹์œผ๋กœ ์ž์‹ ์˜ (DocID, TF)๋ฅผ ๊ฐ€๋ฆฌํ‚ด
  2. → ๋‚˜์ค‘์— ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ ํ˜•์‹์˜ Inverted Index์˜ ๊ธฐํ‹€์ด ๋จ.
  3. 1-3-5. ํ•„์š” ์‹œ, DF ์ •๋ณด ์ถ”๊ฐ€
  4. ํ•œ ๋ฌธ์„œ์˜ ์ค‘๋ณต Term์€ ํ•˜๋‚˜๋กœ ํ†ต์ผ.
  5. 1-3-2. Term์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•จ.
  6. Inverted Index๋ฅผ ๋งŒ๋“ฆ.
  7. 1-3. Indexer : ํ† ํฐ์„ ์ธ๋ฑ์Šค๋กœ ๋งŒ๋“ฆ.
  8. Tokenizer ๋ชจ๋“ˆ์„ ํ†ตํ•ด ํ† ํฐํ™” ํ•จ.
  9. 1-1. ํ† ํฐํ™” : ๋ฌธ์„œ๋“ค์˜ ๋ฌธ์žฅ์„ ๊ณต๋ฐฑ์„ ๊ธฐ์ค€์œผ๋กœ ์ž๋ฆ„

์‚ฌ๋žŒ์ด ์“ฐ๋Š” ๋‹จ์–ด๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ Dictionary size๋Š” Postings list์— ๋น„ํ•ด ๋ณ€ํ™”๊ฐ€ ์ ์Œ.

๋”ฐ๋ผ์„œ, Dictionary๋Š” ๊ฒ€์ƒ‰ ์„œ๋น„์Šค๋ฅผ ์œ„ํ•ด Main Memory๋กœ load ๋จ.

Postings๋Š” ํฌ๊ธฐ๊ฐ€ ์ ์  ์ปค์ง€๊ณ  ๋ณ€ํ™”๋„ ์žฆ์•„ Disk์— ์ €์žฅ๋จ.

Postings list ์ €์žฅ ๋ฐฉ๋ฒ•

  1. ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ
  2. ๊ฐ€๋ณ€ ๋ฐฐ์—ด
  3. hybrid scheme : ์—ฐ๊ฒฐ๋ฆฌ์ŠคํŠธ + ๊ฐ€๋ณ€ ๋ฐฐ์—ด (๋ฐฐ์—ด์„ ์—ฐ๊ฒฐ๋ฆฌ์ŠคํŠธ๋กœ ์—ฐ๊ฒฐํ•จ.)

์งˆ์˜์–ด

  1. Conjunctive : AND์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ํ•ฉif not sorting ์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ๊ณฑ
  2. ์กฐ๊ฑด : Posting list๊ฐ€ ์ •๋ ฌ๋˜์–ด ์žˆ์–ด์•ผ ํ•จ.
  3. Posting list๋ผ๋ฆฌ ์ฒซ ๋ฒˆ์งธ ์ธ๋ฑ์Šค๋ถ€ํ„ฐ ๋น„๊ตํ•ด์„œ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๊ฐ™์œผ๋ฉด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ , ๋‹ค๋ฅด๋ฉด ์ž‘์€ ์ชฝ์˜ Posting์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•˜๊ณ  ๋‹ค์‹œ ๋น„๊ต
  4. Disjunctive : OR์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ํ•ฉ
  5. Posting list๋ผ๋ฆฌ ์ฒซ ๋ฒˆ์งธ ์ธ๋ฑ์Šค๋ถ€ํ„ฐ ๋น„๊ตํ•ด์„œ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๊ฐ™์œผ๋ฉด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ , ๋‹ฌ๋ผ๋„ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๋” ์ž‘์€ ์ชฝ์„ ๋จผ์ € ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ  ์ž‘์€ ์ชฝ์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค.
  6. NOT → ์‹œ๊ฐ„ ์˜ค๋ž˜ ๊ฑธ๋ฆผNOT์ด ๋ถ™์€ Posting list๋Š” ์˜ค๋ฅธ์ชฝ์œผ๋กœ ๋„˜์–ด๊ฐ€๊ธฐ ์ „์—, ์ž์‹ ์—๊ฒŒ ์—†๋Š” Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ์™€ ๋‹ค๋ฅธ Posting list์™€ ์—ฐ์‚ฐํ•จ.NOT์ด ๋ถ™์€ Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ NOT์ด ๋ถ™์ง€ ์•Š์€ Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ์™€ ๋‹ค๋ฅด๋ฉด, ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๊ณ  ์ž‘์€ ์ชฝ์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค. ๊ฐ™์œผ๋ฉด ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๊ณ  ๋‘ Posting ๋ชจ๋‘ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค.
  7. <AND>
  8. NOT์ด ๋ถ™์€ ๋‹จ์–ด๊ฐ€ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จ๋˜๋ฉด ์•ˆ ๋จ.

WestLaw ์‹œ์Šคํ…œ

์งˆ์˜์–ด๋ฅผ ๋งŒ๋“ค ๋•Œ, AND, OR, NOT + /์ˆซ์ž, ! ๊ฐ€๋Šฅํ•จ.

๋„์–ด์“ฐ๊ธฐ : OR

/<์ˆซ์ž> : <์ˆซ์ž> ๋‹จ์–ด ์ด๋‚ด์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ ๋•Œ, AND

! : Wild-card ๋ฌธ์ž *

/s : ํ•œ ๋ฌธ์žฅ ์•ˆ์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‘˜ ๋‹ค ๋‚˜์™€์•ผ ํ•จ. AND

/p : ํ•œ ๋ฌธ๋‹จ ์•ˆ์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‘˜ ๋‹ค ๋‚˜์™€์•ผ ํ•จ. AND

Proximity operators : ์ธ์ ‘์„ฑ์„ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ๋Š” ์—ฐ์‚ฐ ๊ธฐํ˜ธ

์—ฐ์‚ฐ ์ˆœ์„œ ์ตœ์ ํ™”

  1. ์•ž๋ถ€ํ„ฐ ์ฒ˜๋ฆฌ
  2. ์ˆœ์„œ ๋ฐ”๊ฟ”์„œ ์ฒ˜๋ฆฌ → ์‹œ๊ฐ„ ์ ˆ์•ฝ ๊ฐ€๋ŠฅOR ์‹œ๊ฐ„ ๋ณต์žก๋„ : Posting list size์˜ ํ•ฉ
    1. Doc Freq๊ฐ€ ์ž‘์€ ๊ฒƒ๋ถ€ํ„ฐ ์ฒ˜๋ฆฌํ•จ.
    2. Posting list size == Doc Freq
    3. ์—ฐ์‚ฐ ์‹œ ์‹œ๊ฐ„ ๋ณต์žก๋„๋ฅผ ์ตœ๋Œ€๋กœ ๊ณ„์‚ฐํ•ด์„œ ์‹œ๊ฐ„ ๋ณต์žก๋„๊ฐ€ ์ž‘์€ ์ˆœ์„œ๋Œ€๋กœ ์ฒ˜๋ฆฌํ•จ.
  3. AND ์‹œ๊ฐ„ ๋ณต์žก๋„ : Posting list size์˜ ํ•ฉ

์ธ์ ‘์กฐ๊ฑด

์งˆ์˜์–ด์˜ ๋‹จ์–ด๊ฐ€ ๋„์–ด์“ฐ๊ธฐ ๋‹จ์œ„๋กœ ๋Š์–ด์ง€๋ฉด ์•ˆ ๋˜๊ณ  ๋ถ™์–ด์žˆ์–ด์•ผ ํ•  ๋•Œ

Proximity ์ธ์ ‘์กฐ๊ฑด ๋ช…๋ น์–ด : NEAR

๊ตฌ์กฐํ™”๋œ TEXT๊นŒ์ง€ ์ƒ๊ฐํ•ด์„œ ๊ฒ€์ƒ‰ํ•ด๋ณด๋Š” ๋ฐฉ๋ฒ•๋„ ์ƒ๊ฐํ•ด๋ณด๊ธฐ(์ €์ž์™€ ๋‚ด์šฉ ๊ตฌ์กฐ๊ฐ€ ๋งž๋Š” ๊ฒ€์ƒ‰)

Inverted Index + Position Information

Position Information : Posting์ด ํ•ด๋‹น ๋ฌธ์„œ๊ฐ€ ๊ทธ ๋ฌธ์„œ ๋‚ด์—์„œ ์–ด๋””์— ๋‚˜์™”๋Š”์ง€ ์œ„์น˜ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด list๋ฅผ ๊ฐ€๋ฆฌํ‚ด

Position Information Size = Term Frequency

๋‘ ๋‹จ์–ด๊ฐ€ ์ธ์ ‘ํ•ด ์žˆ๋Š”์ง€ ์ฐพ๋Š” ๋ฒ• : AND ์—ฐ์‚ฐ์œผ๋กœ ๊ฐ ๋ฌธ์„œ์— ํ•ด๋‹น Term ๋‚˜์™”๋Š”์ง€ ๊ฒ€์‚ฌ → ๋‘ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋น„๊ตํ•จ → ์ธ์ ‘ ์กฐ๊ฑด์— ๋งž์„ ๊ฒฝ์šฐ ๊ฒฐ๊ณผ์— ํฌํ•จ

์ธ์ ‘ ์ฐจ์ด : 1์ด๋ฉด ๋‘ ๋‹จ์–ด๊ฐ€ ๋ถ™์–ด์žˆ์Œ์„ ์˜๋ฏธํ•˜๊ณ , 2๋ฉด ๋‘ ๋‹จ์–ด ์‚ฌ์ด์— ํ•œ ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Œ์„ ์˜๋ฏธํ•จ.

Term Freq ํ™œ์šฉ

Term Freq๊ฐ€ ํฐ ์ˆœ์œผ๋กœ ๊ฒฐ๊ณผ ์ œ๊ณตํ•˜๊ธฐ

์–ด๋–ค Term์ด ํ•˜๋‚˜์˜ ๋ฌธ์„œ์—์„œ Document์—์„œ ๋ช‡ ๋ฒˆ ๋‚˜์™”๋Š”์ง€ ๊ณ ๋ ค

์–ด๋–ค Posting์˜ Term Freq == ๊ทธ Posting์˜ Positional Information ๋ฆฌ์ŠคํŠธ์˜ ๊ธธ์ด

๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋กœ, ๊ฐ ๋ฌธ์„œ์˜ Term Freq์˜ ํ•ฉ์ด ํฐ ๋ฌธ์„œ๋ถ€ํ„ฐ ์ œ๊ณตํ•จ.

Ranking Search

Boolean ๋ชจ๋ธ์€ ํ•ด๋‹น ์งˆ์˜์–ด๊ฐ€ ๋ฌธ์„œ์— ํฌํ•จ ๋๋Š”์ง€ ์•ˆ ๋๋Š”์ง€๋งŒ ํŒ๋‹จํ•˜๊ธฐ ๋•Œ๋ฌธ์— Ranking์„ ๋งค๊ธฐ๋Š” ๊ฑด ์›์น™์ ์œผ๋กœ๋Š” ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.

ํ•˜์ง€๋งŒ, ๊ธฐ์ค€์„ ์ ์šฉํ•ด์„œ ์กฐ๊ธˆ ๋” ์œ ์šฉํ•  ๋งŒํ•œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•  ์ˆ˜๋„ ์žˆ๋‹ค. Proximity ์ธ์ ‘์„ฑ์„ ์ด์šฉํ•ด์„œ ์ธ์ ‘ํ•  ์ˆ˜๋ก ์‚ฌ์šฉ์ž๊ฐ€ ๋” ์›ํ•˜๋Š” ์ •๋ณด์— ๋” ๊ฐ€๊นŒ์šธ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๊ฐ€๊นŒ์šด ์ˆœ์œผ๋กœ ์ œ๊ณตํ•˜๊ฑฐ๋‚˜, ๋ฌธ์„œ์˜ Term Freq๋ฅผ ์ด์šฉํ•ด์„œ ํฐ ์ˆœ์œผ๋กœ ์ œ๊ณตํ•˜๊ฑฐ๋‚˜, ๋ฌธ์„œ๊ฐ€ ๋งŒ๋“ค์–ด์ง„ ์‹œ๊ฐ„์„ ์ธ๋ฑ์Šค์— ๋ถ€์—ฌํ•ด์„œ ์ตœ๊ทผ ๋ฌธ์„œ๋ถ€ํ„ฐ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๊ธด ํ•˜๋‹ค.

์ •๋ณด ๊ฒ€์ƒ‰ VS DB ๊ฒ€์ƒ‰

์ •๋ณด ๊ฒ€์ƒ‰

๋น„๊ตฌ์กฐํ™” ๋จ

๋ฒ”์œ„ ๊ฒ€์ƒ‰ ๋ถˆ๊ฐ€๋Šฅ

๋‹จ, TEXT์ด๋”๋ผ๊ณ  ์•ฝ๊ฐ„์˜ ๊ตฌ์กฐ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Œ. Semi-Structured Data ์˜ˆ) PPT, XML

๊ธ€์ž์˜ ํฌ๊ธฐ, ๊ธ€์ž์˜ ๊ตต๊ธฐ์™€ ๊ฐ™์€ ์ •๋ณด๋ฅผ ์ด์šฉํ•ด ์•ฝ๊ฐ„์˜ ๊ตฌ์กฐ๋ฅผ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ.

DB ๊ฒ€์ƒ‰

๊ตฌ์กฐํ™”๋จ.

๋ฒ”์œ„ ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅ

๊ด€๋ จ ์žˆ๋Š” Concept ๊ฒ€์ƒ‰ : ์•„์ง ๋ถˆ๊ฐ€๋Šฅ

์šฉ์–ด

Clustering : ๊ตฐ์ง‘(๋ชจ์œผ๊ธฐ)

Classfication : ๋ถ„๋ฅ˜

์›น ๊ฒ€์ƒ‰

๋‹ค์–‘ํ•œ ๋ฌธ์„œ, ๋‹ค์–‘ํ•œ ์‚ฌ๋žŒ, ์งˆ์˜์–ด, ์ •๋ณด ์กด์žฌํ•จ.

๋งํฌ๋œ ์ •๋ณด ํ™œ์šฉํ•˜๊ฑฐ๋‚˜ ํด๋ฆญํ•œ ์ •๋ณด ํ™œ์šฉ ๊ฐ€๋Šฅํ•จ.

Cross-language information retrieval : ๊ต์ฐจ ์–ธ์–ด, ๋ฒˆ์—ญํ•ด์„œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ ์ œ๊ณต

Question answering : ์งˆ์˜ ์‘๋‹ต ์‹œ์Šคํ…œ ์งˆ๋ฌธ&๋‹ต๋ณ€

Summarization : ๊ฒ€์ƒ‰๊ฒฐ๊ณผ ์š”์•ฝํ•ด์„œ ์ œ๊ณต

TEXT mining : TEXT์—์„œ ํ•„์š”ํ•œ ์ •๋ณด ๋ฝ‘์•„์„œ ์•Œ๋ ค์คŒ

1996๋…„) TEXT์™€ ๊ฐ™์ด ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ์˜ ๊ทœ๋ชจ๊ฐ€ ๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋ณด๋‹ค ์ปธ์Œ. ๊ทผ๋ฐ ์‹œ์žฅ์—์„œ๋Š” ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ฑด ๋ˆ์ด ๋˜์ง€ ์•Š์•˜์Œ

2006๋…„) ์—ฌ์ „ํžˆ ๋น„๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ์˜ ๊ทœ๋ชจ๊ฐ€ ํฌ์ง€๋งŒ TEXT ๊ฒ€์ƒ‰ ์„œ๋น„์Šค ์‹œ์žฅ์ด ์ปค์ง€๋ฉด์„œ ๋น„๊ตฌ์กฐํ™” ๋ฌธ์„œ๋ฅผ ๋‹ค๋ฃจ๋Š” ์‹œ์žฅ์ด ์„ฑ์žฅํ•จ

์ •๋ณด๊ฒ€์ƒ‰์‹œ์Šคํ…œ์ด ์—†๋‹ค๋ฉด, ์œ ๋‹‰์Šค์˜ grep๋ช…๋ น์–ด, | ๋กœ ์ฐพ์„ ์ˆ˜ ์žˆ์Œ. ํ•˜์ง€๋งŒ, ๋ฌธ์„œ์˜ ์–‘์ด ํฌ๋‹ค๋ฉด ์†๋„๊ฐ€ ๋А๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— ์ •๋ณด ๊ฒ€์ƒ‰ ๋ชฉ์ ์œผ๋กœ๋Š” ์ ์ ˆํ•˜์ง€ ์•Š๋‹ค. ํŠน์ • ๋‚ด์šฉ์ด ๋“ค์–ด๊ฐ€ ์žˆ๋Š” ๊ฑด ๋“ค์–ด๊ฐ€ ์žˆ์œผ๋ฉด ๋” ํƒ์ƒ‰ํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋‹ˆ ์‰ฌ์šด ํŽธ์ด์ง€๋งŒ, ํŠน์ • ๋‚ด์šฉ์ด ๋“ค์–ด๊ฐ€ ์žˆ์ง€ ์•Š์€ ๊ฑด ํŒŒ์ผ์„ ๋๊นŒ์ง€ ๋‹ค ํƒ์ƒ‰ํ•ด ๋ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋ ต๋‹ค. ๋‹จ์–ด๊ฐ€ ์ธ์ ‘ํ•ด ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋Š” grep ๋ช…๋ น์œผ๋กœ ์ฐพ์„ ์ˆœ ์—†๋‹ค.

Boolean ๋ชจ๋ธ์€ ๋‹จ์–ด ํฌํ•จ ์—ฌ๋ถ€๋งŒ ์•Œ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋กœ ์ถ”์ถœ๋œ ๋ฌธ์„œ๋“ค์˜ ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๊ธด ํž˜๋“ค๋‹ค. grep ๋˜ํ•œ, Boolean ๋ชจ๋ธ์— ๋”ฐ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๊ธด ํž˜๋“ค๋‹ค.

Boolean ๋ชจ๋ธ๋„ ์ ์ˆ˜๋ฅผ ๋งค๊ฒจ์„œ Sortํ•  ์ˆ˜ ์žˆ๊ธด ํ•˜์ง€๋งŒ, ๊ธฐ๋ณธ์ ์œผ๋ก  ์ˆœ์œ„ ๋งค๊ธธ ์ˆ˜ ์—†๋‹ค.

$ grep -v <์งˆ์˜์–ด> * : ๋ชจ๋“  ๋ฌธ์„œ ํŒŒ์ผ(*)์—์„œ ์งˆ์˜์–ด๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์€(-v) ๋ฌธ์„œ๋ฅผ ํƒ์ƒ‰ํ•œ๋‹ค.

Term-document (incidence) matrix Term๊ณผ Document์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ–‰๋ ฌ

๊ฒ€์ƒ‰ ๋Œ€์ƒ์ด ๋˜๋Š” ๋ชจ๋“  ๋ฌธ์„œ์— ๋Œ€ํ•ด, Term์ด Document์— ๋‚˜์˜ค๋ฉด 1, ๋‚˜์˜ค์ง€ ์•Š์œผ๋ฉด 0

์งˆ์˜์–ด์™€ ์ผ์น˜ํ•˜๋Š” Term์˜ ํ–‰๋“ค์„ ์ฐพ์•„์„œ ์ฃผ์–ด์ง„ ์งˆ์˜์–ด์— ๋”ฐ๋ผ ํ–‰๋ผ๋ฆฌ AND, OR, NOT ์—ฐ์‚ฐ ์ˆ˜ํ–‰

์ด๋•Œ, NOT์€ 1๊ณผ 0์„ ๋งž๋ฐ”๊ฟˆ.

bitwise : ๊ฐ™์€ ์œ„์น˜์˜ bit๋ผ๋ฆฌ ์—ฐ์‚ฐ

์˜ˆ) ์งˆ์˜์–ด : Brutus AND Caesar but NOT Calpurnia

Brutus = 110100, Caesar = 110111, Calpurnia = 010000

์—ฐ์‚ฐ : 110100 AND 110111 AND 101111 = 100100

→ 1, 4๋ฒˆ์งธ ๋ฌธ์„œ๊ฐ€ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จ๋˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ํฌํ•จ๋˜์ง€ ์•Š์Œ.

Term-document (incidence) matrix๊ฐ€ ์ฃผ์–ด์ ธ ์žˆ๋‹ค๋ฉด, ์งˆ์˜์–ด ํ‚ค์›Œ๋“œ์— ํ•ด๋‹นํ•˜๋Š” Term์˜ ํ–‰์„ ์ฐพ์•„์„œ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•ด๋‚ผ ์ˆ˜ ์žˆ์Œ. (Boolean ๊ฒ€์‚ฌ ๊ธฐ๋ณธ ๋ฐฉ๋ฒ•๋ก )

Boolean Model ๊ธฐ๋ณธ ๊ฐ€์ • : ๋ฌธ์„œ์˜ ์ง‘ํ•ฉ์€ ๊ณ ์ •๋˜์–ด ์žˆ์Œ.

๋ง๋ญ‰์น˜ : Corpus, ๋ฌธ์„œ์˜ ์ง‘ํ•ฉ

Refine : ์‚ฌ๋žŒ์ด๋‚˜ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์ด ์งˆ์˜์–ด๋ฅผ ๋” ์ •๊ตํ•˜๊ฒŒ ์ˆ˜์ •ํ•ด์„œ ์žฌ๊ฒ€์ƒ‰์„ ํ•˜๋Š” ํ–‰์œ„

๊ฒ€์ƒ‰๋„ ์‹œ์Šคํ…œ ํ‰๊ฐ€ ๊ธฐ์ค€

Precision ์ •๋ฐ€๋„ : ๊ฒฐ๊ณผ๊ฐ€ ์งˆ์˜์–ด์™€ ์–ผ๋งˆ๋‚˜ ๋ถ€ํ•ฉํ•˜๋Š”์ง€

๋ชจ๋ธ์ด True๋ผ๊ณ  ๋ถ„๋ฅ˜ํ•œ ๊ฒƒ ์ค‘์—์„œ ์‹ค์ œ True์ธ ๊ฒƒ์˜ ๋น„์œจ(=์ •๋‹ต๋ฅ )

์‹ค์ œ๋กœ ์‚ฌ์šฉ์ž๊ฐ€ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ฌธ์„œ / ๋ชจ๋ธ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ

์ •๋‹ต / ๋‚ด๊ฐ€

Recall ์žฌํ˜„์œจ : ์งˆ์˜์–ด์— ๋ถ€ํ•ฉํ•˜๋Š” ๋ฌธ์„œ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ฐพ์•„๋‚ด๋Š”์ง€

์‹ค์ œ True์ธ ๊ฒƒ ์ค‘์—์„œ ๋ชจ๋ธ์ด True๋ผ๊ณ  ์˜ˆ์ธกํ•œ ๊ฒƒ์˜ ๋น„์œจ(=์ œ๊ณต๋ฅ )

์ •๋‹ต / ์‹ค์ œ

์ •ํ™•๋„(Accuracy)๋Š” ์ •๋ณด ๊ฒ€์ƒ‰ ํ‰๊ฐ€์— ์“ฐ์ด์ง€ ์•Š์Œ

์ •๋ฐ€๋„๊ฐ€ ๋†’์œผ๋ฉด ์žฌํ˜„์œจ์ด ๋‚ฎ๊ณ , ์ •๋ฐ€๋„๊ฐ€ ๋‚ฎ์œผ๋ฉด ์žฌํ˜„์œจ์ด ๋†’์Œ.

์ •๋ฐ€๋„๊ฐ€ 100์ธ ๊ฒฝ์šฐ : ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ฌธ์„œ 100๊ฐœ์—์„œ 1๊ฐœ ์ฐพ์Œ. → ์žฌํ˜„์œจ์€ 0.01

์žฌํ˜„์œจ์ด 100์ธ ๊ฒฝ์šฐ : ์ฐพ์€ ๋ฌธ์„œ 100๊ฐœ ์ค‘ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋‚ด์šฉ์ด ๋‹ด๊ธด ๋ชจ๋“  ๋ฌธ์„œ๊ฐ€ 1๊ฐœ์ž„ → ์ •๋ฐ€๋„๋Š” 0.01

**F-measure : ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์— ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๊ณ  ๊ตฌํ•œ ํ‰๊ท **

→ ํ”ํžˆ ์•ŒํŒŒ(๊ฐ€์ค‘์น˜)๋ฅผ 0.5๋กœ ์ฃผ์–ด ์•„๋ž˜์™€ ๊ฐ™์ด ๋งŒ๋“ฆ

F-measure๋กœ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ ํ‰๊ฐ€์— ์‚ฌ์šฉํ•จ.

์˜ˆ์‹œ)

Q1. ๊ฒ€์ƒ‰ ๋Œ€์ƒ ๋ฌธ์„œ 100๋งŒ ๊ฐœ, ๊ฐ ๋ฌธ์„œ๋Š” ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์˜ ๋‹จ์–ด๊ฐ€ ์žˆ์Œ. ํ•œ ๋‹จ์–ด๋Š” ํ‰๊ท ์ ์œผ๋กœ ๋„์–ด์“ฐ๊ธฐ, ๋ถ€ํ˜ธ ํฌํ•จ 6bytes(6๊ธ€์ž). ๋ฌธ์„œ ์ „์ฒด ํฌ๊ธฐ๋Š”?

A1. ๋ฌธ์„œ ์ „์ฒด ํฌ๊ธฐ = 100๋งŒ * (1000 * 6bytes) = 6,000,000,000 = 60์–ตbytes = 6GB

Q2. 10์–ต(100๋งŒ * 1000 = 1,000,000,000)๊ฐœ ์˜ ๋‹จ์–ด ์ค‘, 500,000๊ฐœ์˜ ๋‹ค๋ฅธ ๋‹จ์–ด๊ฐ€ ์žˆ๋‹ค. Term-Document Matrix์˜ ํฌ๊ธฐ๋Š”?

A2.

Term ๊ฐœ์ˆ˜ = 50๋งŒ

Document ๊ฐœ์ˆ˜ = 100๋งŒ

Term-Document Matrix ํฌ๊ธฐ = 50๋งŒ * 100๋งŒ = 5์ฒœ ์–ต

→ Boolean ๊ฒ€์ƒ‰์„ ์œ„ํ•ด Term-Document Matrix๋ฅผ ๋ฏธ๋ฆฌ ๋งŒ๋“ค์–ด์•ผ ํ•˜๋Š”๋ฐ, ํฌ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ํผ.

5์ฒœ ์–ต๊ฐœ์˜ 0๊ณผ 1์ค‘ 0์˜ ๋น„์ค‘์ด ๋ณดํ†ต 99.8%๋ฅผ ์ฐจ์ง€ํ•˜์—ฌ 1์€ 0.2%๋ฅผ ์ฐจ์ง€ํ•˜์—ฌ 1000๊ฐœ๋กœ ๋งŽ์ง€ ์•Š์Œ.

๋”ฐ๋ผ์„œ, 1์ธ ๊ฒƒ๋งŒ ๋”ฐ๋กœ ํ‘œ์‹œํ•˜๋„๋ก ํ•จ. ⇒ Inverted Index (Inverted file)

0์ด ์ฐจ์ง€ํ•˜๋Š” ๋น„์œจ

term(ํ–‰) = 50๋งŒ๊ฐœ, document(์—ด) = 100๋งŒ๊ฐœ

ํ•˜๋‚˜์˜ ๋ฌธ์„œ์˜ term ์ •๋ณด = ํ•œ ์—ด(50๋งŒ๊ฐœ์˜ term), ํ•˜๋‚˜์˜ ๋ฌธ์„œ๋Š” 1000๊ฐœ์˜ ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋จ ⇒ ํ•˜๋‚˜์˜ ๋ฌธ์„œ์— term ์ค‘ ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์˜ ๋‹จ์–ด๋งŒ ๋‚˜์˜ค๋ฏ€๋กœ 50๋งŒ ๊ฐœ์˜ ํ•œ ์—ด์—์„œ 1์€ ํ‰๊ท ์ ์œผ๋กœ 1000๊ฐœ์ž„. 49๋งŒ9์ฒœ๊ฐœ๋Š” 0์ž„. ⇒ 50๋งŒ ๊ฐœ์—์„œ 1000๊ฐœ๊ฐ€ 1์ด์—ˆ์œผ๋‹ˆ 100๋งŒ๊ฐœ์—์„œ๋Š” 2000๊ฐœ๊ฐ€ 1์ž„. ⇒ 2์ฒœ/100๋งŒ = 0.002. 0.2%๊ฐ€ 1์ด๊ณ , 99.8%๊ฐ€ 0์ž„.

Inverted index (inverted file)

Term-Document Matrix์˜ ๊ณต๊ฐ„ ์ƒ์˜ ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด 1์ธ ๋ฌธ์„œ๋งŒ ๋ชฉ๋กํ™”ํ•จ.

๊ฒ€์ƒ‰ ๋Œ€์ƒ(๋ฌธ์„œ, ์›น, PC๊ฒฝ๋กœ)์— ๋”ฐ๋ผ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— Posting list์—๋Š” ๋…ผ๋ฆฌ์ ์œผ๋กœ ๋ฌธ์„œ์˜ ๋ฒˆํ˜ธ ์‹๋ณ„์ž๋ฅผ ๋ถ€์—ฌํ•จ.

Posting : ๋ฌธ์„œ ๋ฒˆํ˜ธ ์‹๋ณ„์ž ํ•˜๋‚˜ํ•˜๋‚˜

Posting list : Posting์ด ๋‚˜์—ด๋œ ํ•˜๋‚˜์˜ ๋ฐฐ์—ด(๋ฆฌ์ŠคํŠธ)

Posting lists : Posting list์˜ ์ง‘ํ•ฉ

์งˆ์˜์–ด๋ฅผ ์ฐพ์„ ๋•Œ ๋” ๋นจ๋ฆฌ ์ฐพ์„ ์ˆ˜ ์žˆ๋„๋ก ๋ฌธ์„œ๋ฅผ ์ •๋ ฌํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, Posting list๋Š” ๋ฐฐ์—ด๋ณด๋‹ค ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ๋ฅผ ์„ ํ˜ธํ•œ๋‹ค. ํ•˜์ง€๋งŒ ํฌ์ธํ„ฐ ๊ณต๊ฐ„์„ ๋” ์ฐจ์ง€ํ•œ๋‹ค.

Dictionary == Vocabulary (Term์˜ ์ง‘ํ•ฉ)

Inverted Index ๋งŒ๋“œ๋Š” ๊ณผ์ •

[๋‹จ๊ณ„]

  1. Tokenization (ํ† ํฐํ™”)→ ๋งŒ์•ฝ, ๋ฌธ์„œ๊ฐ€ TXT๊ฐ€ ์•„๋‹Œ ์•„๋ž˜ํ•œ๊ธ€, html์ด๋ผ๋ฉด? → ๊ฒฝ์šฐ์— ๋”ฐ๋ผ ํ† ํฐํ™” ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Œ.1-2. Linguistic Modules๋ฅผ ํ†ตํ•ด ์–ธ์–ด์— ๋”ฐ๋ผ ์ฒ˜๋ฆฌํ•จ. (์˜์–ด : ๋ชจ๋‘ ์†Œ๋ฌธ์ž๋กœ, ์›ํ˜•์œผ๋กœ ๋งŒ๋“ฆ)์–ด๋А ํ† ํฐ์ด ์–ด๋А ๋ฌธ์„œ์—์„œ ๋‚˜์™”๋Š”์ง€ Postings list(ํ•ด๋‹น ํ† ํฐ์˜ ๋ฌธ์„œ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ„) ๊ตฌ์ถ•1-3-1. (Term, DocID) ์Œ์œผ๋กœ ๋งŒ๋“ฆ.1-3-3. ํ•„์š” ์‹œ, TF ์ •๋ณด ์ถ”๊ฐ€Term Frequency(TF) : ํ•œ ๋ฌธ์„œ๋ฅผ ๊ธฐ์ค€์œผ๋กœ Term์ด ๋‚˜์˜จ ๊ฐœ์ˆ˜ ์ถ”๊ฐ€
        Document Frequency(DF) : Term์ด ๋‚˜์˜จ ๋ฌธ์„œ์˜ ๊ฐœ์ˆ˜ ์ถ”๊ฐ€
    
        ์ด๋•Œ Term Frequency๋ผ๋ฆฌ ํ•ฉ์„ ๊ตฌํ•จ.
    
    DF๋กœ ํ•ฉ์ณ์ง„ ์ค‘๋ณต Term์€ ๊ฐ๊ฐ ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ ์‹์œผ๋กœ ์ž์‹ ์˜ (DocID, TF)๋ฅผ ๊ฐ€๋ฆฌํ‚ด
  2. → ๋‚˜์ค‘์— ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ ํ˜•์‹์˜ Inverted Index์˜ ๊ธฐํ‹€์ด ๋จ.
  3. 1-3-5. ํ•„์š” ์‹œ, DF ์ •๋ณด ์ถ”๊ฐ€
  4. ํ•œ ๋ฌธ์„œ์˜ ์ค‘๋ณต Term์€ ํ•˜๋‚˜๋กœ ํ†ต์ผ.
  5. 1-3-2. Term์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•จ.
  6. Inverted Index๋ฅผ ๋งŒ๋“ฆ.
  7. 1-3. Indexer : ํ† ํฐ์„ ์ธ๋ฑ์Šค๋กœ ๋งŒ๋“ฆ.
  8. Tokenizer ๋ชจ๋“ˆ์„ ํ†ตํ•ด ํ† ํฐํ™” ํ•จ.
  9. 1-1. ํ† ํฐํ™” : ๋ฌธ์„œ๋“ค์˜ ๋ฌธ์žฅ์„ ๊ณต๋ฐฑ์„ ๊ธฐ์ค€์œผ๋กœ ์ž๋ฆ„

์‚ฌ๋žŒ์ด ์“ฐ๋Š” ๋‹จ์–ด๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ Dictionary size๋Š” Postings list์— ๋น„ํ•ด ๋ณ€ํ™”๊ฐ€ ์ ์Œ.

๋”ฐ๋ผ์„œ, Dictionary๋Š” ๊ฒ€์ƒ‰ ์„œ๋น„์Šค๋ฅผ ์œ„ํ•ด Main Memory๋กœ load ๋จ.

Postings๋Š” ํฌ๊ธฐ๊ฐ€ ์ ์  ์ปค์ง€๊ณ  ๋ณ€ํ™”๋„ ์žฆ์•„ Disk์— ์ €์žฅ๋จ.

Postings list ์ €์žฅ ๋ฐฉ๋ฒ•

  1. ์—ฐ๊ฒฐ ๋ฆฌ์ŠคํŠธ
  2. ๊ฐ€๋ณ€ ๋ฐฐ์—ด
  3. hybrid scheme : ์—ฐ๊ฒฐ๋ฆฌ์ŠคํŠธ + ๊ฐ€๋ณ€ ๋ฐฐ์—ด (๋ฐฐ์—ด์„ ์—ฐ๊ฒฐ๋ฆฌ์ŠคํŠธ๋กœ ์—ฐ๊ฒฐํ•จ.)

์งˆ์˜์–ด

  1. Conjunctive : AND์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ํ•ฉif not sorting ์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ๊ณฑ
  2. ์กฐ๊ฑด : Posting list๊ฐ€ ์ •๋ ฌ๋˜์–ด ์žˆ์–ด์•ผ ํ•จ.
  3. Posting list๋ผ๋ฆฌ ์ฒซ ๋ฒˆ์งธ ์ธ๋ฑ์Šค๋ถ€ํ„ฐ ๋น„๊ตํ•ด์„œ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๊ฐ™์œผ๋ฉด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ , ๋‹ค๋ฅด๋ฉด ์ž‘์€ ์ชฝ์˜ Posting์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•˜๊ณ  ๋‹ค์‹œ ๋น„๊ต
  4. Disjunctive : OR์‹œ๊ฐ„๋ณต์žก๋„ : ๋น„๊ตํ•˜๋Š” Posting List ๊ธธ์ด์˜ ํ•ฉ
  5. Posting list๋ผ๋ฆฌ ์ฒซ ๋ฒˆ์งธ ์ธ๋ฑ์Šค๋ถ€ํ„ฐ ๋น„๊ตํ•ด์„œ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๊ฐ™์œผ๋ฉด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ , ๋‹ฌ๋ผ๋„ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ ๋” ์ž‘์€ ์ชฝ์„ ๋จผ์ € ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จํ•˜๊ณ  ์ž‘์€ ์ชฝ์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค.
  6. NOT → ์‹œ๊ฐ„ ์˜ค๋ž˜ ๊ฑธ๋ฆผNOT์ด ๋ถ™์€ Posting list๋Š” ์˜ค๋ฅธ์ชฝ์œผ๋กœ ๋„˜์–ด๊ฐ€๊ธฐ ์ „์—, ์ž์‹ ์—๊ฒŒ ์—†๋Š” Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ์™€ ๋‹ค๋ฅธ Posting list์™€ ์—ฐ์‚ฐํ•จ.NOT์ด ๋ถ™์€ Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ๊ฐ€ NOT์ด ๋ถ™์ง€ ์•Š์€ Posting์˜ ๋ฌธ์„œ ๋ฒˆํ˜ธ์™€ ๋‹ค๋ฅด๋ฉด, ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๊ณ  ์ž‘์€ ์ชฝ์ด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค. ๊ฐ™์œผ๋ฉด ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๊ณ  ๋‘ Posting ๋ชจ๋‘ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•œ๋‹ค.
  7. <AND>
  8. NOT์ด ๋ถ™์€ ๋‹จ์–ด๊ฐ€ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จ๋˜๋ฉด ์•ˆ ๋จ.

WestLaw ์‹œ์Šคํ…œ

์งˆ์˜์–ด๋ฅผ ๋งŒ๋“ค ๋•Œ, AND, OR, NOT + /์ˆซ์ž, ! ๊ฐ€๋Šฅํ•จ.

๋„์–ด์“ฐ๊ธฐ : OR

/<์ˆซ์ž> : <์ˆซ์ž> ๋‹จ์–ด ์ด๋‚ด์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ ๋•Œ, AND

! : Wild-card ๋ฌธ์ž *

/s : ํ•œ ๋ฌธ์žฅ ์•ˆ์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‘˜ ๋‹ค ๋‚˜์™€์•ผ ํ•จ. AND

/p : ํ•œ ๋ฌธ๋‹จ ์•ˆ์— ๋‘ ๋‹จ์–ด๊ฐ€ ๋‘˜ ๋‹ค ๋‚˜์™€์•ผ ํ•จ. AND

Proximity operators : ์ธ์ ‘์„ฑ์„ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ๋Š” ์—ฐ์‚ฐ ๊ธฐํ˜ธ

์—ฐ์‚ฐ ์ˆœ์„œ ์ตœ์ ํ™”

  1. ์•ž๋ถ€ํ„ฐ ์ฒ˜๋ฆฌ
  2. ์ˆœ์„œ ๋ฐ”๊ฟ”์„œ ์ฒ˜๋ฆฌ → ์‹œ๊ฐ„ ์ ˆ์•ฝ ๊ฐ€๋ŠฅOR ์‹œ๊ฐ„ ๋ณต์žก๋„ : Posting list size์˜ ํ•ฉ
    1. Doc Freq๊ฐ€ ์ž‘์€ ๊ฒƒ๋ถ€ํ„ฐ ์ฒ˜๋ฆฌํ•จ.
    2. Posting list size == Doc Freq
    3. ์—ฐ์‚ฐ ์‹œ ์‹œ๊ฐ„ ๋ณต์žก๋„๋ฅผ ์ตœ๋Œ€๋กœ ๊ณ„์‚ฐํ•ด์„œ ์‹œ๊ฐ„ ๋ณต์žก๋„๊ฐ€ ์ž‘์€ ์ˆœ์„œ๋Œ€๋กœ ์ฒ˜๋ฆฌํ•จ.
  3. AND ์‹œ๊ฐ„ ๋ณต์žก๋„ : Posting list size์˜ ํ•ฉ

์ธ์ ‘์กฐ๊ฑด

์งˆ์˜์–ด์˜ ๋‹จ์–ด๊ฐ€ ๋„์–ด์“ฐ๊ธฐ ๋‹จ์œ„๋กœ ๋Š์–ด์ง€๋ฉด ์•ˆ ๋˜๊ณ  ๋ถ™์–ด์žˆ์–ด์•ผ ํ•  ๋•Œ

Proximity ์ธ์ ‘์กฐ๊ฑด ๋ช…๋ น์–ด : NEAR

๊ตฌ์กฐํ™”๋œ TEXT๊นŒ์ง€ ์ƒ๊ฐํ•ด์„œ ๊ฒ€์ƒ‰ํ•ด๋ณด๋Š” ๋ฐฉ๋ฒ•๋„ ์ƒ๊ฐํ•ด๋ณด๊ธฐ(์ €์ž์™€ ๋‚ด์šฉ ๊ตฌ์กฐ๊ฐ€ ๋งž๋Š” ๊ฒ€์ƒ‰)

Inverted Index + Position Information

Position Information : Posting์ด ํ•ด๋‹น ๋ฌธ์„œ๊ฐ€ ๊ทธ ๋ฌธ์„œ ๋‚ด์—์„œ ์–ด๋””์— ๋‚˜์™”๋Š”์ง€ ์œ„์น˜ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด list๋ฅผ ๊ฐ€๋ฆฌํ‚ด

Position Information Size = Term Frequency

๋‘ ๋‹จ์–ด๊ฐ€ ์ธ์ ‘ํ•ด ์žˆ๋Š”์ง€ ์ฐพ๋Š” ๋ฒ• : AND ์—ฐ์‚ฐ์œผ๋กœ ๊ฐ ๋ฌธ์„œ์— ํ•ด๋‹น Term ๋‚˜์™”๋Š”์ง€ ๊ฒ€์‚ฌ → ๋‘ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋น„๊ตํ•จ → ์ธ์ ‘ ์กฐ๊ฑด์— ๋งž์„ ๊ฒฝ์šฐ ๊ฒฐ๊ณผ์— ํฌํ•จ

์ธ์ ‘ ์ฐจ์ด : 1์ด๋ฉด ๋‘ ๋‹จ์–ด๊ฐ€ ๋ถ™์–ด์žˆ์Œ์„ ์˜๋ฏธํ•˜๊ณ , 2๋ฉด ๋‘ ๋‹จ์–ด ์‚ฌ์ด์— ํ•œ ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Œ์„ ์˜๋ฏธํ•จ.

Term Freq ํ™œ์šฉ

Term Freq๊ฐ€ ํฐ ์ˆœ์œผ๋กœ ๊ฒฐ๊ณผ ์ œ๊ณตํ•˜๊ธฐ

์–ด๋–ค Term์ด ํ•˜๋‚˜์˜ ๋ฌธ์„œ์—์„œ Document์—์„œ ๋ช‡ ๋ฒˆ ๋‚˜์™”๋Š”์ง€ ๊ณ ๋ ค

์–ด๋–ค Posting์˜ Term Freq == ๊ทธ Posting์˜ Positional Information ๋ฆฌ์ŠคํŠธ์˜ ๊ธธ์ด

๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋กœ, ๊ฐ ๋ฌธ์„œ์˜ Term Freq์˜ ํ•ฉ์ด ํฐ ๋ฌธ์„œ๋ถ€ํ„ฐ ์ œ๊ณตํ•จ.

Ranking Search

Boolean ๋ชจ๋ธ์€ ํ•ด๋‹น ์งˆ์˜์–ด๊ฐ€ ๋ฌธ์„œ์— ํฌํ•จ ๋๋Š”์ง€ ์•ˆ ๋๋Š”์ง€๋งŒ ํŒ๋‹จํ•˜๊ธฐ ๋•Œ๋ฌธ์— Ranking์„ ๋งค๊ธฐ๋Š” ๊ฑด ์›์น™์ ์œผ๋กœ๋Š” ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.

ํ•˜์ง€๋งŒ, ๊ธฐ์ค€์„ ์ ์šฉํ•ด์„œ ์กฐ๊ธˆ ๋” ์œ ์šฉํ•  ๋งŒํ•œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•  ์ˆ˜๋„ ์žˆ๋‹ค. Proximity ์ธ์ ‘์„ฑ์„ ์ด์šฉํ•ด์„œ ์ธ์ ‘ํ•  ์ˆ˜๋ก ์‚ฌ์šฉ์ž๊ฐ€ ๋” ์›ํ•˜๋Š” ์ •๋ณด์— ๋” ๊ฐ€๊นŒ์šธ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๊ฐ€๊นŒ์šด ์ˆœ์œผ๋กœ ์ œ๊ณตํ•˜๊ฑฐ๋‚˜, ๋ฌธ์„œ์˜ Term Freq๋ฅผ ์ด์šฉํ•ด์„œ ํฐ ์ˆœ์œผ๋กœ ์ œ๊ณตํ•˜๊ฑฐ๋‚˜, ๋ฌธ์„œ๊ฐ€ ๋งŒ๋“ค์–ด์ง„ ์‹œ๊ฐ„์„ ์ธ๋ฑ์Šค์— ๋ถ€์—ฌํ•ด์„œ ์ตœ๊ทผ ๋ฌธ์„œ๋ถ€ํ„ฐ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๊ธด ํ•˜๋‹ค.

์ •๋ณด ๊ฒ€์ƒ‰ VS DB ๊ฒ€์ƒ‰

์ •๋ณด ๊ฒ€์ƒ‰

๋น„๊ตฌ์กฐํ™” ๋จ

๋ฒ”์œ„ ๊ฒ€์ƒ‰ ๋ถˆ๊ฐ€๋Šฅ

๋‹จ, TEXT์ด๋”๋ผ๊ณ  ์•ฝ๊ฐ„์˜ ๊ตฌ์กฐ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Œ. Semi-Structured Data ์˜ˆ) PPT, XML

๊ธ€์ž์˜ ํฌ๊ธฐ, ๊ธ€์ž์˜ ๊ตต๊ธฐ์™€ ๊ฐ™์€ ์ •๋ณด๋ฅผ ์ด์šฉํ•ด ์•ฝ๊ฐ„์˜ ๊ตฌ์กฐ๋ฅผ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ.

DB ๊ฒ€์ƒ‰

๊ตฌ์กฐํ™”๋จ.

๋ฒ”์œ„ ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅ

๊ด€๋ จ ์žˆ๋Š” Concept ๊ฒ€์ƒ‰ : ์•„์ง ๋ถˆ๊ฐ€๋Šฅ

์šฉ์–ด

Clustering : ๊ตฐ์ง‘(๋ชจ์œผ๊ธฐ)

Classfication : ๋ถ„๋ฅ˜

์›น ๊ฒ€์ƒ‰

๋‹ค์–‘ํ•œ ๋ฌธ์„œ, ๋‹ค์–‘ํ•œ ์‚ฌ๋žŒ, ์งˆ์˜์–ด, ์ •๋ณด ์กด์žฌํ•จ.

๋งํฌ๋œ ์ •๋ณด ํ™œ์šฉํ•˜๊ฑฐ๋‚˜ ํด๋ฆญํ•œ ์ •๋ณด ํ™œ์šฉ ๊ฐ€๋Šฅํ•จ.

Cross-language information retrieval : ๊ต์ฐจ ์–ธ์–ด, ๋ฒˆ์—ญํ•ด์„œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ ์ œ๊ณต

Question answering : ์งˆ์˜ ์‘๋‹ต ์‹œ์Šคํ…œ ์งˆ๋ฌธ&๋‹ต๋ณ€

Summarization : ๊ฒ€์ƒ‰๊ฒฐ๊ณผ ์š”์•ฝํ•ด์„œ ์ œ๊ณต

TEXT mining : TEXT์—์„œ ํ•„์š”ํ•œ ์ •๋ณด ๋ฝ‘์•„์„œ ์•Œ๋ ค์คŒ