Programmer's Haven
« Need help - 23 Gig File, Indexing it... »

Welcome Guest. Please Login or Register.
Dec 27, 2009, 11:57am




Programmer's Haven :: General :: General Programming :: Need help - 23 Gig File, Indexing it...
« Page 2 of 2 Jump to page   Go    [Search This Thread][Send Topic To Friend] [Print]
 AuthorTopic: Need help - 23 Gig File, Indexing it... (Read 162 times)
Garrett
Administrator
member is offline

[avatar]

I shall conquor the worl..... Ooooo, pretty blinking LED lights!


[homepage]

Joined: Dec 2006
Gender: Male
Posts: 623
Location: In my mind
 Re: Need help - 23 Gig File, Indexing it...
« Reply #15 on Sept 15, 2009, 9:57pm »

Well, with a different approach in coding I was able to cut 3 days indexing down to 10 hours indexing. Still not good enough, I want around 3 hours total time. So I'm still toying with ideas here.


Quote:
How about dtSearch at http://www.dtsearch.com/


Ooooo.. Pricey!!!
Link to Post - Back to Top  IP: Logged

'What you do not want done to yourself, do not do to others.' - Confucius (550 b.c. to 479 b.c.)

Some software and links: [Parabolic Logic]
My blog, or, My Rants: [MSN Live Spaces]
Michael
Uber Yapper
*****
member is offline

[avatar]


[homepage]

Joined: Oct 2007
Gender: Male
Posts: 974
 Re: Need help - 23 Gig File, Indexing it...
« Reply #16 on Sept 17, 2009, 6:43pm »

> Unfortunately wp2txt doesn't parse article for article, but in merely in the amount of bytes you specify.

Hmm... not good. Let me stew on this awhile Mr.G, maybe I can come up with something in the way of a SED script & RegEx that might help.
Link to Post - Back to Top  IP: Logged

Link Mojo: A Link Exchange For Small Software Developers
Garrett
Administrator
member is offline

[avatar]

I shall conquor the worl..... Ooooo, pretty blinking LED lights!


[homepage]

Joined: Dec 2006
Gender: Male
Posts: 623
Location: In my mind
 Re: Need help - 23 Gig File, Indexing it...
« Reply #17 on Sept 18, 2009, 1:34am »

Well we've made some headway on the Rev forums on this too. We've got the speed issue pretty much resolved now, but have run into one little issue now during the indexing routine. Some how at some point, our character offset becomes corrupt.

But I think someone posted a possible fix earlier today also. Too tired to check on it now though.

I think we're almost there!

I sure hope so. I've only found 3 offline Wikipedia viewers. Of the 3, only 2 are capable of viewing the entire database, the 1 only contains a set number of specific articles(under 2000 when there's over 3 million in the full database). The last two, while good and allow the entire db, suffer from formatting errors/issues in their viewers, one of which has tables issues, the other leaves extraneous data within the article which makes it near impossible to focus on the articles at all.
Link to Post - Back to Top  IP: Logged

'What you do not want done to yourself, do not do to others.' - Confucius (550 b.c. to 479 b.c.)

Some software and links: [Parabolic Logic]
My blog, or, My Rants: [MSN Live Spaces]
fieldens
Full Member
***
member is offline




[homepage]

Joined: Nov 2007
Gender: Male
Posts: 110
Location: Marshall
 Re: Need help - 23 Gig File, Indexing it...
« Reply #18 on Sept 18, 2009, 8:45am »

Hello Garrett,

I don't know if you can use something like this or not. I got it from the PB forum a few years ago. Its an INSTR replacement thats susposed to be faster than INSTR.

It was faster than PB's INSTR, but it may not be in the software you use.

I'll post it just in case you can use something like this and you are using INSTR.


Code:



'----------------------------------------------------------------------------
' INSTR replacement, very fast when doing repeated search for long strings
' in long strings. Original code by Steve Hutchesson.
' Adjusted to work as direct replacement for INSTR by Borje Hagsten.
' No negative startpos for backwards search though, sorry.
'
' startpos can range from 1 to LEN(MainStr) - LEN(SearchStr).
' Returned position is 1-based, just like INSTR. Return is zero if not found.
'------------------------------------------------------------------------------

FUNCTION bmINSTR(BYVAL startpos AS LONG, m AS STRING, s AS STRING) AS LONG
#REGISTER NONE
LOCAL eLen AS LONG, cval AS LONG
LOCAL lpSource AS LONG, lnSource AS LONG, lpSearch AS LONG, lnSearch AS LONG
LOCAL shift_table AS STRING * 1024

! mov esi, s ; store Search string ptr and len
! mov esi, [esi]
! cmp esi, 0 ; if edi is zero - no search string
! je Cleanup2 ; then get out
! mov lpSearch, esi
! mov esi, [esi-4]
! mov lnSearch, esi

! cmp lnSearch, 3 ; check search len
! jb ShortWordScan ; if shorter than 3, use INSTR instead

! mov esi, m ; store Main string ptr and len
! mov esi, [esi]
! cmp esi, 0 ; if esi is zero - no search string
! je Cleanup2 ; then get out
! mov lpSource, esi
! mov esi, [esi-4]
! mov lnSource, esi

! cmp startpos, 0 ; check startpos
! je OKsize2 ; if zero, ok
! dec startpos ; else assume > 0 - decrease, since bmINSTR is 0-based

OKsize2:
! mov esi, lpSource
! add esi, lnSource
! sub esi, lnSearch
! mov eLen, esi ; set Exit Length

' ----------------------------------------
' load shift table with value in lnSearch
' ----------------------------------------
! mov ecx, 256
! mov eax, lnSearch
! lea edi, shift_table
! rep stosd

' ----------------------------------------------
' load decending count values into shift table
' ----------------------------------------------
! mov ecx, lnSearch ; SubString length in ECX
! dec ecx ; correct for zero based index
! mov esi, lpSearch ; address of SubString in ESI
! lea edi, shift_table
! xor eax, eax

Write_Shift_Chars2:
! mov al, [esi] ; get the character
! inc esi ; next one
! mov [edi+eax*4], ecx ; write shift for each character
! dec ecx ; to ascii location in table
! jnz Write_Shift_Chars2

' -----------------------------
' set up for main compare loop
' -----------------------------
! mov ecx, lnSearch
! dec ecx
! mov cval, ecx

! mov esi, lpSource
! mov edi, lpSearch
! add esi, startpos ; add starting position

! mov edx, lnSearch ; pattern length in edx
! jmp Cmp_Loop2

'*********************** Loop Code ***************************
Set_Suffix_Shift2:
! add eax, ecx ; add CMP count
! sub eax, cval ; sub loop count
! cmp eax, 0 ; test eax for zero
! jg Add_Suffix_Shift2
! mov eax, 1 ; minimum shift is 1

Add_Suffix_Shift2:
! add esi, eax ; add suffix shift

Pre_Cmp2:
! cmp esi, eLen ; test exit length
! ja No_Match2
! xor eax, eax ; reset EAX
! mov ecx, cval ; reset counter in compare loop

Cmp_Loop2:
! mov al, [esi+ecx]
! cmp al, [edi+ecx] ; cmp characters in ESI / EDI
! jne Get_Shift2 ; if not equal, get next shift
! dec ecx
! jns Cmp_Loop2

! jmp Match2

Get_Shift2:
! mov eax, shift_table[eax*4] ; get char shift value
! cmp eax, edx ; is eax pattern length ?
! jne Set_Suffix_Shift2 ; if not, jump to Calc_Suffix_Shift
! lea esi, [esi+ecx+1] ; add bad char shift
! jmp Pre_Cmp2
' ***************************************************************

Match2:
! sub esi, lpSource ; sub source from ESI
! mov eax, esi ; put length in eax
! inc eax ; adjust for 1-based return
! jmp Cleanup2 ; exit

No_Match2:
! mov eax, 0 ; set value for no match

Cleanup2:
! mov FUNCTION, eax ; return 1-based position
EXIT FUNCTION

ShortWordScan: ' if search len is < 3
FUNCTION = INSTR(startpos, m, s)

END FUNCTION

Link to Post - Back to Top  IP: Logged
Garrett
Administrator
member is offline

[avatar]

I shall conquor the worl..... Ooooo, pretty blinking LED lights!


[homepage]

Joined: Dec 2006
Gender: Male
Posts: 623
Location: In my mind
 Re: Need help - 23 Gig File, Indexing it...
« Reply #19 on Sept 18, 2009, 12:12pm »

I'm gonna keep that just in case we can't resolve the Rev code issues.

Thanks! :-)
Link to Post - Back to Top  IP: Logged

'What you do not want done to yourself, do not do to others.' - Confucius (550 b.c. to 479 b.c.)

Some software and links: [Parabolic Logic]
My blog, or, My Rants: [MSN Live Spaces]
Garrett
Administrator
member is offline

[avatar]

I shall conquor the worl..... Ooooo, pretty blinking LED lights!


[homepage]

Joined: Dec 2006
Gender: Male
Posts: 623
Location: In my mind
 Re: Need help - 23 Gig File, Indexing it...
« Reply #20 on Sept 19, 2009, 2:07pm »

~!HOLD THE PRESSES!~

I've received a lot of help from a lot of people over on the Rev forums for this project, and with their help, we've not only achieved my goal, but surpassed it!!!

My goal was to be able to index the entire 23 gig file in 3 hours, as is the time for the other programs like this have done.

It took a mere 32 minutes total to completely index the entire 23 gig XML file. The end result is an index listing 9,013,937 entries of which only around 3 million are actual entries.. the rest are extraneous entries which serve as redirects to the articles or are image references or category references. In my final version I'll be implementing a filter which will weed out as much of the extraneous stuff as possible. The index file only weighs in at 312 mb.

I can't thank everyone over there enough for all the help that gave me.. I can't even thank you guys enough for trying to help me out on this as well.

I'm just so excited at the end result that frankly I'm still a bit in shock at the final total time!

I'm not done with this project yet of course.. Still many things to do, but that was the biggest hurdle right there. I had already done much of the other work needed, such as search, grab data from the file, parse the data, display the data in a formatted view etc... But still more work to be done.

I hope this can become one of the more viable offline wikipedia programs available today.

It'll be open source, so if anyone is interested in it at a later date, you can grab the source and do as you wish with it.

Thanks everyone for all your help! :-)
Link to Post - Back to Top  IP: Logged

'What you do not want done to yourself, do not do to others.' - Confucius (550 b.c. to 479 b.c.)

Some software and links: [Parabolic Logic]
My blog, or, My Rants: [MSN Live Spaces]
Jerry Muelver
Retired Admin
member is offline

[avatar]

Any questions?


[homepage]

Joined: Dec 2006
Gender: Male
Posts: 721
Location: Northwoods Wisconsin
 Re: Need help - 23 Gig File, Indexing it...
« Reply #21 on Sept 20, 2009, 5:12am »

Congrats, Garrett!

Now you can begin work on health care reform, world hunger, and global warming! All it takes is perseverance and creative thinking. Looks like you've got the right tools for the job. ::)
Link to Post - Back to Top  IP: Logged

"Wiki" is the answer. What was the question, again?
North American Ido Society home page
Tweets at http://twitter.com/jmuelver
Michael
Uber Yapper
*****
member is offline

[avatar]


[homepage]

Joined: Oct 2007
Gender: Male
Posts: 974
 Re: Need help - 23 Gig File, Indexing it...
« Reply #22 on Sept 20, 2009, 8:56am »

Kewl! 32mins. aint too shabby at all.
Link to Post - Back to Top  IP: Logged

Link Mojo: A Link Exchange For Small Software Developers
Jerry Muelver
Retired Admin
member is offline

[avatar]

Any questions?


[homepage]

Joined: Dec 2006
Gender: Male
Posts: 721
Location: Northwoods Wisconsin
 Re: Need help - 23 Gig File, Indexing it...
« Reply #23 on Oct 28, 2009, 7:52pm »

Yo, Garrett! Is this you, again?
http://tech.yahoo.com/news/ap/20091028/a....test_wikireader

::)
Link to Post - Back to Top  IP: Logged

"Wiki" is the answer. What was the question, again?
North American Ido Society home page
Tweets at http://twitter.com/jmuelver
Garrett
Administrator
member is offline

[avatar]

I shall conquor the worl..... Ooooo, pretty blinking LED lights!


[homepage]

Joined: Dec 2006
Gender: Male
Posts: 623
Location: In my mind
 Re: Need help - 23 Gig File, Indexing it...
« Reply #24 on Oct 28, 2009, 10:18pm »

Jerry.. that's a sweet little device! But I'd like it if they went with a color screen and tossed in the images that go with the articles. Maybe they'll do that in a year or two. :-)

BTW, I had to shelf my project because classes started last month and I was not done with it. I last left off with formatting issues with the articles that errors in formatting markers in them.

If anyone is interested in the sources for this to either derive a project or continue this project, feel free to PM me with an email address to send a zip file (without wikipedia database.. download it from their site instead).
Link to Post - Back to Top  IP: Logged

'What you do not want done to yourself, do not do to others.' - Confucius (550 b.c. to 479 b.c.)

Some software and links: [Parabolic Logic]
My blog, or, My Rants: [MSN Live Spaces]
« Page 2 of 2 Jump to page   Go    [Search This Thread][Send Topic To Friend] [Print]

Google
Webcodecraft.proboards.com
Click Here To Make This Board Ad-Free


This Board Hosted For FREE By ProBoards
Get Your Own Free Message Boards & Free Forums!